Initial commit to metrics and monitoring course

2026-03-05 07:28:47 +00:00 · 2021-02-08 05:25:23 -08:00
parent bbd0cd38b5
commit a3ffe9c1d0
21 changed files with 693 additions and 0 deletions
--- a/courses/index.md
+++ b/courses/index.md
@@ -20,6 +20,7 @@ In this course, we are focusing on building strong foundational skills. The cour
    -   [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
    -   [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
 -   [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
 -   [Metrics and Monitoring](metrics_and_monitoring/introduction.md)
 -   [Security](https://linkedin.github.io/school-of-sre/security/intro/)
 We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.
--- a/courses/metrics_and_monitoring/alerts.md
+++ b/courses/metrics_and_monitoring/alerts.md
@@ -0,0 +1,28 @@
 ##
 # Proactive monitoring using alerts
 Earlier we discussed different ways to collect key metric data points
 from a service and its underlying infrastructure. This data gives us a
 better understanding of how the service is performing. One of the main
 objectives of monitoring is to detect any service degradations early
 (reduce Mean Time To Detect) and notify stakeholders so that the issues
 are either avoided or can be fixed early, thus reducing Mean Time To
 Recover (MTTR). For example, if you are notified when resource usage by
 a service exceeds 90 percent, you can take preventive measures to avoid
 any service breakdown due to a shortage of resources. On the other hand,
 when a service goes down due to an issue, early detection and
 notification of such incidents can help you quickly fix the issue.
 ![An alert notification received on Slack](images/image11.png) <p align="center"> Figure 8: An alert notification received on Slack </p>
 Today most of the monitoring services available provide a mechanism to
 set up alerts on one or a combination of metrics to actively monitor the
 service health. These alerts have a set of defined rules or conditions,
 and when the rule is broken, you are notified. These rules can be as
 simple as notifying when the metric value exceeds n to as complex as a
 week over week (WoW) comparison of standard deviation over a period of
 time. Monitoring tools notify you about an active alert, and most of
 these tools support instant messaging (IM) platforms, SMS, email, or
 phone calls. Figure 8 shows a sample alert notification received on
 Slack for memory usage exceeding 90 percent of total RAM space on the
 host.
--- a/courses/metrics_and_monitoring/best_practices.md
+++ b/courses/metrics_and_monitoring/best_practices.md
@@ -0,0 +1,40 @@
 ##
 # Best practices for monitoring
 When setting up monitoring for a service, keep the following best
 practices in mind.
 -   **Use the right metric type** -- Most of the libraries available
     today offer various metric types. Choose the appropriate metric
     type for monitoring your system. Following are the types of
     metrics and their purposes.
    -   **Gauge --** *Gauge* is a constant type of metric. After the
         metric is initialized, the metric value does not change unless
         you intentionally update it.
    -   **Timer --** *Timer* measures the time taken to complete a
         task.
    -   **Counter --** *Counter* counts the number of occurrences of a
         particular event.
 For more information about these metric types, see [Data
 Types](https://statsd.readthedocs.io/en/v0.5.0/types.html).
 -   **Avoid over-monitoring** -- Monitoring can be a significant
     engineering endeavor***.*** Therefore, be sure not to spend too
     much time and resources on monitoring services, yet make sure all
     important metrics are captured.
 -   **Prevent alert fatigue** -- Set alerts for metrics that are
     important and actionable. If you receive too many non-critical
     alerts, you might start ignoring alert notifications over time. As
     a result, critical alerts might get overlooked.
 -   **Have a runbook for alerts** -- For every alert, make sure you have
     a document explaining what actions and checks need to be performed
     when the alert fires. This enables any engineer on the team to
     handle the alert and take necessary actions, without any help from
     others.
--- a/courses/metrics_and_monitoring/command-line_tools.md
+++ b/courses/metrics_and_monitoring/command-line_tools.md
@@ -0,0 +1,98 @@
 ##
 # Command-line tools
 Most of the Linux distributions today come with a set of tools that
 monitor the system's performance. These tools help you measure and
 understand various subsystem statistics (CPU, memory, network, and so
 on). Let's look at some of the tools that are predominantly used.
 -   `ps/top `-- The process status command (ps) displays information
     about all the currently running processes in a Linux system. The
     top command is similar to the ps command, but it periodically
     updates the information displayed until the program is terminated.
     An advanced version of top, called htop, has a more user-friendly
     interface and some additional features. These command-line
     utilities come with options to modify the operation and output of
     the command. Following are some important options supported by the
     ps command.
    -   `-p <pid1, pid2,...>` -- Displays information about processes
         that match the specified process IDs. Similarly, you can use
         `-u <uid>` and `-g <gid>` to display information about
         processes belonging to a specific user or group.
    -   `-a` -- Displays information about other users' processes, as well
         as one's own.
    -   `-x` -- When displaying processes matched by other options,
         includes processes that do not have a controlling terminal.
 ![Results of top command](images/image12.png) <p align="center"> Figure 2: Results of top command </p>
 -   `ss` -- The socket statistics command (ss) displays information
     about network sockets on the system. This tool is the successor of
     [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html),
     which is deprecated. Following are some command-line options
     supported by the ss command:
    -   `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP
         sockets, `-x` is for UNIX domain sockets, and so on.
    -   `-l` -- Displays only listening sockets.
    -   `-n` -- Instructs the command to not resolve service names.
         Instead displays the port numbers.
 ![List of listening sockets on a system](images/image8.png) <p align="center"> Figure
 3: List of listening sockets on a system </p>
 -   `free` -- The free command displays memory usage statistics on the
     host like available memory, used memory, and free memory. Most often,
     this command is used with the `-h` command-line option, which
     displays the statistics in a human-readable format.
 ![Memory
   statistics on a host in human-readable form](images/image6.png) <p align="center"> Figure 4: Memory
 statistics on a host in human-readable form </p>
 -   `df --` The df command displays disk space usage statistics. The
     `-i` command-line option is also often used to display
     [inode](https://en.wikipedia.org/wiki/Inode) usage
     statistics. The `-h` command-line option is used for displaying
     statistics in a human-readable format.
 ![Disk usage statistics on a system in human-readable form](images/image9.png) <p align="center"> Figure 5:
 Disk usage statistics on a system in human-readable form </p>
 -   `sar` -- The sar utility monitors various subsystems, such as CPU
     and memory, in real time. This data can be stored in a file
     specified with the `-o` option. This tool helps to identify
     anomalies.
 -   `iftop` -- The interface top command (`iftop`) displays bandwidth
     utilization by a host on an interface. This command is often used
     to identify bandwidth usage by active connections. The `-i` option
     specifies which network interface to watch.
 ![Network bandwidth usage by
  active connection on the host](images/image2.png) <p align="center"> Figure 6: Network bandwidth usage by
 active connection on the host </p>
 -   `tcpdump` -- The tcpdump command is a network monitoring tool that
     captures network packets flowing over the network and displays a
     description of the captured packets. The following options are
     available:
    -   `-i <interface>` -- Interface to listen on
    -   `host <IP/hostname>` -- Filters traffic going to or from the
         specified host
    -   `src/dst` -- Displays one-way traffic from the source (src) or to
         the destination (dst)
    -   `port <port number>` -- Filters traffic to or from a particular
         port
 ![tcpdump of packets on an interface](images/image10.png) <p align="center"> Figure 7: *tcpdump* of packets on *docker0*
 interface on a host </p>
--- a/courses/metrics_and_monitoring/conclusion.md
+++ b/courses/metrics_and_monitoring/conclusion.md
@@ -0,0 +1,50 @@
 # Conclusion
 A robust monitoring and alerting system is necessary for maintaining and
 troubleshooting a system. A dashboard with key metrics can give you an
 overview of service performance, all in one place. Well-defined alerts
 (with realistic thresholds and notifications) further enable you to
 quickly identify any anomalies in the service infrastructure and in
 resource saturation. By taking necessary actions, you can avoid any
 service degradations and decrease MTTD for service breakdowns.
 In addition to in-house monitoring, monitoring real user experience can
 help you to understand service performance as perceived by the users.
 Many modules are involved in serving the user, and most of them are out
 of your control. Therefore, you need to have real-user monitoring in
 place.
 Metrics give very abstract details on service performance. To get a
 better understanding of the system and for faster recovery during
 incidents, you might want to implement the other two pillars of
 observability: logs and tracing. Logs and trace data can help you
 understand what led to service failure or degradation.
 Following are some resources to learn more about monitoring and
 observability:
 -   [Google SRE book: Monitoring Distributed
     Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
 -   [Mastering Distributed Tracing by Yuri
     Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
 -   Engineering blogs on
     [LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring),
     [Grafana](https://grafana.com/blog/),
     [Elastic.co](https://www.elastic.co/blog/),
     [OpenTelemetry](https://medium.com/opentelemetry)
 ## References
 -   [Google SRE book: Monitoring Distributed
     Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
 -   [Mastering Distributed Tracing, by Yuri
     Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
 -   [Monitoring and
     Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c)
 -   [Three PIllars with Zero
     Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8)
--- a/courses/metrics_and_monitoring/images/image1.jpg
+++ b/courses/metrics_and_monitoring/images/image1.jpg
--- a/courses/metrics_and_monitoring/images/image10.png
+++ b/courses/metrics_and_monitoring/images/image10.png
--- a/courses/metrics_and_monitoring/images/image11.png
+++ b/courses/metrics_and_monitoring/images/image11.png
--- a/courses/metrics_and_monitoring/images/image12.png
+++ b/courses/metrics_and_monitoring/images/image12.png
--- a/courses/metrics_and_monitoring/images/image2.png
+++ b/courses/metrics_and_monitoring/images/image2.png
--- a/courses/metrics_and_monitoring/images/image3.jpg
+++ b/courses/metrics_and_monitoring/images/image3.jpg
--- a/courses/metrics_and_monitoring/images/image4.jpg
+++ b/courses/metrics_and_monitoring/images/image4.jpg
--- a/courses/metrics_and_monitoring/images/image5.jpg
+++ b/courses/metrics_and_monitoring/images/image5.jpg
--- a/courses/metrics_and_monitoring/images/image6.png
+++ b/courses/metrics_and_monitoring/images/image6.png
--- a/courses/metrics_and_monitoring/images/image7.png
+++ b/courses/metrics_and_monitoring/images/image7.png
--- a/courses/metrics_and_monitoring/images/image8.png
+++ b/courses/metrics_and_monitoring/images/image8.png
--- a/courses/metrics_and_monitoring/images/image9.png
+++ b/courses/metrics_and_monitoring/images/image9.png
--- a/courses/metrics_and_monitoring/introduction.md
+++ b/courses/metrics_and_monitoring/introduction.md
@@ -0,0 +1,280 @@
 ##
 # Prerequisites
 -   [Linux  Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
 -   [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
 -   [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
 -   [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
 ## What to expect from this course
 Monitoring is an integral part of any system. As an SRE, you need to
 have a basic understanding of monitoring a service infrastructure. By
 the end of this course, you will gain a better understanding of the
 following topics:
 -   What is monitoring?
    -   What needs to be measured
    -   How the metrics gathered can be used to improve business decisions and overall reliability
    -   Proactive monitoring with alerts
    -   Log processing and its importance
 -   What is observability?
    -   Distributed tracing
    -   Logs
    -   Metrics
 ## What is not covered in this course
 -   Guide to setting up a monitoring infrastructure
 -   Deep dive into different monitoring technologies and benchmarking or comparison of any tools
 ## Course content
 -   [Introduction](#introduction)
    -   [Four golden signals of monitoring](#four-golden-signals-of-monitoring)
    -   [Why is monitoring important?](#why-is-monitoring-important)
 -   [Command-line tools](command-line_tools.md)
 -   [Third-party monitoring](third-party_monitoring.md)
 -   [Proactive monitoring using alerts](alerts.md)
 -   [Best practices for monitoring](best_practices.md)
 -   [Observability](observability.md)
    -   [Logs](observability.md#logs)
    -   [Tracing](observability.md#tracing)
 [Conclusion](conclusion.md)
 ##
 # Introduction
 Monitoring is a process of collecting real-time performance metrics from
 a system, analyzing the data to derive meaningful information, and
 displaying the data to the users. In simple terms, you measure various
 metrics regularly to understand the state of the system, including but
 not limited to, user requests, latency, and error rate. *What gets
 measured, gets fixed*---if you can measure something, you can reason
 about it, understand it, discuss it, and act upon it with confidence.
 ## Four golden signals of monitoring
 When setting up monitoring for a system, you need to decide what to
 measure. The four golden signals of monitoring provide a good
 understanding of service performance and lay a foundation for monitoring
 a system. These four golden signals are
 -   Traffic
 -   Latency
 -   Error
 -   Saturation
 These metrics help you to understand the system performance and
 bottlenecks, and to create a better end-user experience. As discussed in
 the [Google SRE
 book](https://sre.google/sre-book/monitoring-distributed-systems/),
 if you can measure only four metrics of your service, focus on these
 four. Let's look at each of the four golden signals.
 -   **Traffic** -- *Traffic* gives a better understanding of the service
     demand. Often referred to as *service QPS* (queries per second),
     traffic is a measure of requests served by the service. This
     signal helps you to decide when a service needs to be scaled up to
     handle increasing customer demand and scaled down to be
     cost-effective.
 -   **Latency** -- *Latency* is the measure of time taken by the service
     to process the incoming request and send the response. Measuring
     service latency helps in the early detection of slow degradation
     of the service. Distinguishing between the latency of successful
     requests and the latency of failed requests is important. For
     example, an [HTTP 5XX
     error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)
     triggered due to loss of connection to a database or other
     critical backend might be served very quickly. However, because an
     HTTP 500 error indicates a failed request, factoring 500s into
     overall latency might result in misleading calculations.
 -   **Error (rate)** -- *Error* is the measure of failed client
     requests. These failures can be easily identified based on the
     response codes ([HTTP 5XX
     error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)).
     There might be cases where the response is considered erroneous
     due to wrong result data or due to policy violations. For example,
     you might get an [HTTP
     200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200)
     response, but the body has incomplete data, or response time is
     breaching the agreed-upon
     [SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s.
     Therefore, you need to have other mechanisms (code logic or
     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)))
     in place to capture errors in addition to the response codes.
 -   **Saturation** -- *Saturation* is a measure of the resource
     utilization by a service. This signal tells you the state of
     service resources and how full they are. These resources include
     memory, compute, network I/O, and so on. Service performance
     slowly degrades even before resource utilization is at 100
     percent. Therefore, having a utilization target is important. An
     increase in latency is a good indicator of saturation; measuring
     the [99th
     percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf)
     of latency can help in the early detection of saturation.
 Depending on the type of service, you can measure these signals in
 different ways. For example, you might measure queries per second served
 for a web server. In contrast, for a database server, transactions
 performed and database sessions created give you an idea about the
 traffic handled by the database server. With the help of additional code
 logic (monitoring libraries and instrumentation), you can measure these
 signals periodically and store them for future analysis. Although these
 metrics give you an idea about the performance at the service end, you
 need to also ensure that the same user experience is delivered at the
 client end. Therefore, you might need to monitor the service from
 outside the service infrastructure, which is discussed under third-party
 monitoring.
 ## Why is monitoring important?
 Monitoring plays a key role in the success of a service. As discussed
 earlier, monitoring provides performance insights for understanding
 service health. With access to historical data collected over time, you
 can build intelligent applications to address specific needs. Some of
 the key use cases follow:
 -   **Reduction in time to resolve issues** -- With a good monitoring
     infrastructure in place, you can identify issues quickly and
     resolve them, which reduces the impact caused by the issues.
 -   **Business decisions** -- Data collected over a period of time can
     help you make business decisions such as determining the product
     release cycle, which features to invest in, and geographical areas
     to focus on. Decisions based on long-term data can improve the
     overall product experience.
 -   **Resource planning** -- By analyzing historical data, you can
     forecast service compute-resource demands, and you can properly
     allocate resources. This allows financially effective decisions,
     with no compromise in end-user experience.
 Before we dive deeper into monitoring, let's understand some basic
 terminologies.
 -   **Metric** -- A metric is a quantitative measure of a particular
     system attribute---for example, memory or CPU
 -   **Node or host** -- A physical server, virtual machine, or container
     where an application is running
 -   **QPS** -- *Queries Per Second*, a measure of traffic served by the
     service per second
 -   **Latency** -- The time interval between user action and the
     response from the server---for example, time spent after sending a
     query to a database before the first response bit is received
 -   **Error** **rate** -- Number of errors observed over a particular
     time period (usually a second)
 -   **Graph** -- In monitoring, a graph is a representation of one or
     more values of metrics collected over time
 -   **Dashboard** -- A dashboard is a collection of graphs that provide
     an overview of system health
 -   **Incident** -- An incident is an event that disrupts the normal
     operations of a system
 -   **MTTD** -- *Mean Time To Detect* is the time interval between the
     beginning of a service failure and the detection of such failure
 -   **MTTR** -- Mean Time To Resolve is the time spent to fix a service
     failure and bring the service back to its normal state
 Before we discuss monitoring an application, let us look at the
 monitoring infrastructure. Following is an illustration of a basic
 monitoring system.
 ![Illustration of a monitoring infrastructure](images/image1.jpg) <p align="center"> Figure 1: Illustration of a monitoring infrastructure </p>
 Figure 1 shows a monitoring infrastructure mechanism for aggregating
 metrics on the system, and collecting and storing the data for display.
 In addition, a monitoring infrastructure includes alert subsystems for
 notifying concerned parties during any abnormal behavior. Let's look at
 each of these infrastructure components:
 -   **Host metrics agent --** A *host metrics agent* is a process
     running on the host that collects performance statistics for host
     subsystems such as memory, CPU, and network. These metrics are
     regularly relayed to a metrics collector for storage and
     visualization. Some examples are
     [collectd](https://collectd.org/),
     [telegraf](https://www.influxdata.com/time-series-platform/telegraf/),
     and [metricbeat](https://www.elastic.co/beats/metricbeat).
 -   **Metric aggregator --** A *metric aggregator* is a process running
     on the host. Applications running on the host collect service
     metrics using
     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)).
     Collected metrics are sent either to the aggregator process or
     directly to the metrics collector over API, if available. Received
     metrics are aggregated periodically and relayed to the metrics
     collector in batches. An example is
     [StatsD](https://github.com/statsd/statsd).
 -   **Metrics collector --** A *metrics collector* process collects all
     the metrics from the metric aggregators running on multiple hosts.
     The collector takes care of decoding and stores this data on the
     database. Metric collection and storage might be taken care of by
     one single service such as
     [InfluxDB](https://www.influxdata.com/), which we discuss
     next. An example is [carbon
     daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html).
 -   **Storage --** A time-series database stores all of these metrics.
     Examples are [OpenTSDB](http://opentsdb.net/),
     [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html),
     and [InfluxDB](https://www.influxdata.com/).
 -   **Metrics server --** A *metrics server* can be as basic as a web
     server that graphically renders metric data. In addition, the
     metrics server provides aggregation functionalities and APIs for
     fetching metric data programmatically. Some examples are
     [Grafana](https://github.com/grafana/grafana) and
     [Graphite-Web](https://github.com/graphite-project/graphite-web).
 -   **Alert manager --** The *alert manager* regularly polls metric data
     available and, if there are any anomalies detected, notifies you.
     Each alert has a set of rules for identifying such anomalies.
     Today many metrics servers such as
     [Grafana](https://github.com/grafana/grafana) support alert
     management. We discuss alerting [in detail
     later](#proactive-monitoring-using-alerts). Examples are
     [Grafana](https://github.com/grafana/grafana) and
     [Icinga](https://icinga.com/).
--- a/courses/metrics_and_monitoring/observability.md
+++ b/courses/metrics_and_monitoring/observability.md
@@ -0,0 +1,151 @@
 ##
 # Observability
 Engineers often use observability when referring to building reliable
 systems. *Observability* is a term derived from control theory, It is a
 measure of how well internal states of a system can be inferred from
 knowledge of its external outputs. Service infrastructures used on a
 daily basis are becoming more and more complex; proactive monitoring
 alone is not sufficient to quickly resolve issues causing application
 failures. With monitoring, you can keep known past failures from
 recurring, but with a complex service architecture, many unknown factors
 can cause potential problems. To address such cases, you can make the
 service observable. An observable system provides highly granular
 insights into the implicit failure modes. In addition, an observable
 system furnishes ample context about its inner workings, which unlocks
 the ability to uncover deeper systemic issues.
 Monitoring enables failure detection; observability helps in gaining a
 better understanding of the system. Among engineers, there is a common
 misconception that monitoring and observability are two different
 things. Actually, observability is the superset to monitoring; that is,
 monitoring improves service observability. The goal of observability is
 not only to detect problems, but also to understand where the issue is
 and what is causing it. In addition to metrics, observability has two
 more pillars: logs and traces, as shown in Figure 9. Although these
 three components do not make a system 100 percent observable, these are
 the most important and powerful components that give a better
 understanding of the system. Each of these pillars has its flaws, which
 are described in [Three Pillars with Zero
 Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8).
 ![Three pillars of observability](images/image7.png) <p align="center"> Figure 9: 
 Three pillars of observability </p>
 Because we have covered metrics already, let's look at the other two
 pillars (logs and traces).
 #### Logs 
 Logs (often referred to as *events*) are a record of activities
 performed by a service during its run time, with a corresponding
 timestamp. Metrics give abstract information about degradations in a
 system, and logs give a detailed view of what is causing these
 degradations. Logs created by the applications and infrastructure
 components help in effectively understanding the behavior of the system
 by providing details on application errors, exceptions, and event
 timelines. Logs help you to go back in time to understand the events
 that led to a failure. Therefore, examining logs is essential to
 troubleshooting system failures.
 Log processing involves the aggregation of different logs from
 individual applications and their subsequent shipment to central
 storage. Moving logs to central storage helps to preserve the logs, in
 case the application instances are inaccessible, or the application
 crashes due to a failure. After the logs are available in a central
 place, you can analyze the logs to derive sensible information from
 them. For audit and compliance purposes, you archive these logs on the
 central storage for a certain period of time. Log analyzers fetch useful
 information from log lines, such as request user information, request
 URL (feature), and response headers (such as content length) and
 response time. This information is grouped based on these attributes and
 made available to you through a visualization tool for quick
 understanding.
 You might be wondering how this log information helps. This information
 gives a holistic view of activities performed on all the involved
 entities. For example, let's say someone is performing a DoS (denial of
 service) attack on a web application. With the help of log processing,
 you can quickly look at top client IPs derived from access logs and
 identify where the attack is coming from.
 Similarly, if a feature in an application is causing a high error rate
 when accessed with a particular request parameter value, the results of
 log analysis can help you to quickly identify the misbehaving parameter
 value and take further action.
 ![Log processing and analysis using ELK stack](images/image4.jpg) 
 <p align="center"> Figure 10: Log processing and analysis using ELK stack </p>
 Figure 10 shows a log processing platform using ELK (Elasticsearch,
 Logstash, Kibana), which provides centralized log processing. Beats is a
 collection of lightweight data shippers that can ship logs, audit data,
 network data, and so on over the network. In this use case specifically,
 we are using filebeat as a log shipper. Filebeat watches service log
 files and ships the log data to Logstash. Logstash parses these logs and
 transforms the data, preparing it to store on Elasticsearch. Transformed
 log data is stored on Elasticsearch and indexed for fast retrieval.
 Kibana searches and displays log data stored on Elasticsearch. Kibana
 also provides a set of visualizations for graphically displaying
 summaries derived from log data.
 Storing logs is expensive. And extensive logging of every event on the
 server is costly and takes up more storage space. With an increasing
 number of services, this cost can increase proportionally to the number
 of services.
 #### Tracing
 So far, we covered the importance of metrics and logging. Metrics give
 an abstract overview of the system, and logging gives a record of events
 that occurred. Imagine a complex distributed system with multiple
 microservices, where a user request is processed by multiple
 microservices in the system. Metrics and logging give you some
 information about how these requests are being handled by the system,
 but they fail to provide detailed information across all the
 microservices and how they affect a particular client request. If a slow
 downstream microservice is leading to increased response times, you need
 to have detailed visibility across all involved microservices to
 identify such microservice. The answer to this need is a request tracing
 mechanism.
 A trace is a series of spans, where each span is a record of events
 performed by different microservices to serve the client's request. In
 simple terms, a trace is a log of client-request serving derived from
 various microservices across different physical machines. Each span
 includes span metadata such as trace ID and span ID, and context, which
 includes information about transactions performed.
 ![Trace and spans for a URL shortener request](images/image3.jpg) 
 <p align="center"> Figure 11: Trace and spans for a URL shortener request </p>
 Figure 11 is a graphical representation of a trace captured on the [URL
 shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/)
 example we covered earlier while learning Python.
 Similar to monitoring, the tracing infrastructure comprises a few
 modules for collecting traces, storing them, and accessing them. Each
 microservice runs a tracing library that collects traces in the
 background, creates in-memory batches, and submits the tracing backend.
 The tracing backend normalizes received trace data and stores it on
 persistent storage. Tracing data comes from multiple different
 microservices; therefore, trace storage is often organized to store data
 incrementally and is indexed by trace identifier. This organization
 helps in the reconstruction of trace data and in visualization. Figure
 12 illustrates the anatomy of the distributed system.
 ![Anatomy of distributed tracing](images/image5.jpg)
 <p align="center"> Figure 12: Anatomy of distributed tracing </p>
 Today a set of tools and frameworks are available for building
 distributed tracing solutions. Following are some of the popular tools:
 -   [OpenTelemetry](https://opentelemetry.io/): Observability
     framework for cloud-native software
 -   [Jaeger](https://www.jaegertracing.io/): Open-source
     distributed tracing solution
 -   [Zipkin](https://zipkin.io/): Open-source distributed tracing
     solution
--- a/courses/metrics_and_monitoring/third-party_monitoring.md
+++ b/courses/metrics_and_monitoring/third-party_monitoring.md
@@ -0,0 +1,37 @@
 ##
 # Third-party monitoring
 Today most cloud providers offer a variety of monitoring solutions. In
 addition, a number of companies such as
 [Datadog](https://www.datadoghq.com/) offer
 monitoring-as-a-service. In this section, we are not covering
 monitoring-as-a-service in depth.
 In recent years, more and more people have access to the internet. Many
 services are offered online to cater to the increasing user base. As a
 result, web pages are becoming larger, with increased client-side
 scripts. Users want these services to be fast and error-free. From the
 service point of view, when the response body is composed, an HTTP 200
 OK response is sent, and everything looks okay. But there might be
 errors during transmission or on the client side. As previously
 mentioned, monitoring services from within the service infrastructure
 give good visibility into service health, but this is not enough. You
 need to monitor user experience, specifically the availability of
 services for clients. A number of third-party services such asf
 [Catchpoint](https://www.catchpoint.com/),
 [Pingdom](https://www.pingdom.com/), and so on are available for
 achieving this goal.
 Third-party monitoring services can generate synthetic traffic
 simulating user requests from various parts of the world, to ensure the
 service is globally accessible. Other third-party monitoring solutions
 for real user monitoring (RUM) provide performance statistics such as
 service uptime and response time, from different geographical locations.
 This allows you to monitor the user experience from these locations,
 which might have different internet backbones, different operating
 systems, and different browsers and browser versions. [Catchpoint
 Global Monitoring
 Network](https://pages.catchpoint.com/overview-video) is a
 comprehensive 3-minute video that explains the importance of monitoring
 the client experience.
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -57,6 +57,14 @@ nav:
    - Availability: systems_design/availability.md
    - Fault Tolerance: systems_design/fault-tolerance.md
    - Conclusion: systems_design/conclusion.md
 - Metrics and Monitoring:
    - Introduction: metrics_and_monitoring/introduction.md
    - Command-line Tools: metrics_and_monitoring/command-line_tools.md
    - Third-party Monitoring: metrics_and_monitoring/third-party_monitoring.md
    - Proactive Monitoring with Alerts: metrics_and_monitoring/alerts.md
    - Best Practices for Monitoring: metrics_and_monitoring/best_practices.md
    - Observability: metrics_and_monitoring/observability.md
    - Conclusion: metrics_and_monitoring/conclusion.md
 - Security:
    - Introduction: security/intro.md
    - Fundamentals of Security: security/fundamentals.md