docs (level 101): fix typos, punctuation, formatting (#160)

* docs: formatted for readability * docs: rephrased and added punctuation * docs: fix typos, punctuation, formatting * docs: fix typo and format * docs: fix caps and formatting * docs: fix punctuation and formatting * docs: capitalized SQL commands, fixed puntuation, formatting * docs: fix punctuation * docs: fix punctuation and formatting * docs: fix caps,punctuation and formatting * docs: fix links, punctuation, formatting * docs: fix code block formatting * docs: fix punctuation, indentation and formatting
2026-07-07 18:40:34 +00:00 · 2024-07-28 17:38:19 +05:30
parent bdcc6856ed
commit 4239ecf473
58 changed files with 1522 additions and 1367 deletions
@@ -4,11 +4,11 @@
 Earlier we discussed different ways to collect key metric data points
 from a service and its underlying infrastructure. This data gives us a
 better understanding of how the service is performing. One of the main
-objectives of monitoring is to detect any service degradations early
+objectives of monitoring is to detect any service degradations early 
 (reduce Mean Time To Detect) and notify stakeholders so that the issues
 are either avoided or can be fixed early, thus reducing Mean Time To
 Recover (MTTR). For example, if you are notified when resource usage by
-a service exceeds 90 percent, you can take preventive measures to avoid
+a service exceeds 90%, you can take preventive measures to avoid
 any service breakdown due to a shortage of resources. On the other hand,
 when a service goes down due to an issue, early detection and
 notification of such incidents can help you quickly fix the issue.
@@ -20,10 +20,10 @@ Today most of the monitoring services available provide a mechanism to
 set up alerts on one or a combination of metrics to actively monitor the
 service health. These alerts have a set of defined rules or conditions,
 and when the rule is broken, you are notified. These rules can be as
-simple as notifying when the metric value exceeds n to as complex as a
-week over week (WoW) comparison of standard deviation over a period of
+simple as notifying when the metric value exceeds _n_ to as complex as a
+week-over-week (WoW) comparison of standard deviation over a period of
 time. Monitoring tools notify you about an active alert, and most of
 these tools support instant messaging (IM) platforms, SMS, email, or
 phone calls. Figure 8 shows a sample alert notification received on
-Slack for memory usage exceeding 90 percent of total RAM space on the
+Slack for memory usage exceeding 90% of total RAM space on the
 host.
@@ -5,35 +5,35 @@
 When setting up monitoring for a service, keep the following best
 practices in mind.

-   **Use the right metric type** -- Most of the libraries available
+-   **Use the right metric type**&mdash;Most of the libraries available
     today offer various metric types. Choose the appropriate metric
     type for monitoring your system. Following are the types of
     metrics and their purposes.

-    -   **Gauge --** *Gauge* is a constant type of metric. After the
+    -   **Gauge**&mdash;*Gauge* is a constant type of metric. After the
         metric is initialized, the metric value does not change unless
         you intentionally update it.

-    -   **Timer --** *Timer* measures the time taken to complete a
+    -   **Timer**&mdash;*Timer* measures the time taken to complete a
         task.

-    -   **Counter --** *Counter* counts the number of occurrences of a
+    -   **Counter**&mdash;*Counter* counts the number of occurrences of a
         particular event.

 For more information about these metric types, see [Data
 Types](https://statsd.readthedocs.io/en/v0.5.0/types.html).

-   **Avoid over-monitoring** -- Monitoring can be a significant
-     engineering endeavor***.*** Therefore, be sure not to spend too
+-   **Avoid over-monitoring**&mdash;Monitoring can be a significant
+     engineering endeavor. Therefore, be sure not to spend too
     much time and resources on monitoring services, yet make sure all
     important metrics are captured.

-   **Prevent alert fatigue** -- Set alerts for metrics that are
+-   **Prevent alert fatigue**&mdash;Set alerts for metrics that are
     important and actionable. If you receive too many non-critical
     alerts, you might start ignoring alert notifications over time. As
     a result, critical alerts might get overlooked.

-   **Have a runbook for alerts** -- For every alert, make sure you have
+-   **Have a runbook for alerts**&mdash;For every alert, make sure you have
     a document explaining what actions and checks need to be performed
     when the alert fires. This enables any engineer on the team to
     handle the alert and take necessary actions, without any help from
@@ -6,48 +6,48 @@ monitor the system's performance. These tools help you measure and
 understand various subsystem statistics (CPU, memory, network, and so
 on). Let's look at some of the tools that are predominantly used.

-   `ps/top `-- The process status command (ps) displays information
+-   **`ps/top`**: The process status command (`ps`) displays information
     about all the currently running processes in a Linux system. The
-     top command is similar to the ps command, but it periodically
+     top command is similar to the `ps` command, but it periodically
     updates the information displayed until the program is terminated.
-     An advanced version of top, called htop, has a more user-friendly
+     An advanced version of top, called `htop`, has a more user-friendly
     interface and some additional features. These command-line
     utilities come with options to modify the operation and output of
     the command. Following are some important options supported by the
-     ps command.
+     `ps` command.

-    -   `-p <pid1, pid2,...>` -- Displays information about processes
+    -   `-p <pid1, pid2,...>`: Displays information about processes
         that match the specified process IDs. Similarly, you can use
         `-u <uid>` and `-g <gid>` to display information about
         processes belonging to a specific user or group.

-    -   `-a` -- Displays information about other users' processes, as well
+    -   `-a`: Displays information about other users' processes, as well
         as one's own.

-    -   `-x` -- When displaying processes matched by other options,
+    -   `-x`: When displaying processes matched by other options,
         includes processes that do not have a controlling terminal.

 ![Results of top command](images/image12.png) 
 <p align="center"> Figure 2: Results of top command </p>

-   `ss` -- The socket statistics command (ss) displays information
+-   **`ss`**: The socket statistics command (`ss`) displays information
     about network sockets on the system. This tool is the successor of
     [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html),
     which is deprecated. Following are some command-line options
-     supported by the ss command:
+     supported by the `ss` command:

-    -   `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP
+    -   `-t`: Displays the TCP socket. Similarly, `-u` displays UDP
         sockets, `-x` is for UNIX domain sockets, and so on.

-    -   `-l` -- Displays only listening sockets.
+    -   `-l`: Displays only listening sockets.

-    -   `-n` -- Instructs the command to not resolve service names.
+    -   `-n`: Instructs the command to not resolve service names.
         Instead displays the port numbers.

 ![List of listening sockets on a system](images/image8.png) <p align="center"> Figure
 3: List of listening sockets on a system </p>

-   `free` -- The free command displays memory usage statistics on the
+-   **`free`**: The `free` command displays memory usage statistics on the
     host like available memory, used memory, and free memory. Most often,
     this command is used with the `-h` command-line option, which
     displays the statistics in a human-readable format.
@@ -55,7 +55,7 @@ on). Let's look at some of the tools that are predominantly used.
 ![Memory statistics on a host in human-readable form](images/image6.png) 
 <p align="center"> Figure 4: Memory statistics on a host in human-readable form </p>

-   `df --` The df command displays disk space usage statistics. The
+-   **`df`**: The `df` command displays disk space usage statistics. The
     `-i` command-line option is also often used to display
     [inode](https://en.wikipedia.org/wiki/Inode) usage
     statistics. The `-h` command-line option is used for displaying
@@ -65,12 +65,12 @@ on). Let's look at some of the tools that are predominantly used.
 <p align="center"> Figure 5:
 Disk usage statistics on a system in human-readable form </p>

-   `sar` -- The sar utility monitors various subsystems, such as CPU
+-   **`sar`**: The `sar` utility monitors various subsystems, such as CPU
     and memory, in real time. This data can be stored in a file
     specified with the `-o` option. This tool helps to identify
     anomalies.

-   `iftop` -- The interface top command (`iftop`) displays bandwidth
+-   **`iftop`**: The interface top command (`iftop`) displays bandwidth
     utilization by a host on an interface. This command is often used
     to identify bandwidth usage by active connections. The `-i` option
     specifies which network interface to watch.
@@ -80,22 +80,22 @@ on). Let's look at some of the tools that are predominantly used.
  <p align="center"> Figure 6: Network bandwidth usage by
 active connection on the host </p>

-   `tcpdump` -- The tcpdump command is a network monitoring tool that
+-   **`tcpdump`**: The `tcpdump` command is a network monitoring tool that
     captures network packets flowing over the network and displays a
     description of the captured packets. The following options are
     available:

-    -   `-i <interface>` -- Interface to listen on
+    -   `-i <interface>`: Interface to listen on

-    -   `host <IP/hostname>` -- Filters traffic going to or from the
+    -   `host <IP/hostname>`: Filters traffic going to or from the
         specified host

-    -   `src/dst` -- Displays one-way traffic from the source (src) or to
+    -   `src/dst`: Displays one-way traffic from the source (src) or to
         the destination (dst)

-    -   `port <port number>` -- Filters traffic to or from a particular
+    -   `port <port number>`: Filters traffic to or from a particular
         port

 ![tcpdump of packets on an interface](images/image10.png) 
-<p align="center"> Figure 7: *tcpdump* of packets on *docker0*
+<p align="center"> Figure 7: <code>tcpdump</code> of packets on <code>docker0</code>
 interface on a host </p>
@@ -2,13 +2,13 @@

 A robust monitoring and alerting system is necessary for maintaining and
 troubleshooting a system. A dashboard with key metrics can give you an
-overview of service performance, all in one place. Well-defined alerts
+overview of service performance, all in one place. Well-defined alerts 
 (with realistic thresholds and notifications) further enable you to
 quickly identify any anomalies in the service infrastructure and in
 resource saturation. By taking necessary actions, you can avoid any
 service degradations and decrease MTTD for service breakdowns.

-In addition to in-house monitoring, monitoring real user experience can
+In addition to in-house monitoring, monitoring real-user experience can
 help you to understand service performance as perceived by the users.
 Many modules are involved in serving the user, and most of them are out
 of your control. Therefore, you need to have real-user monitoring in
@@ -30,7 +30,6 @@ observability:
     Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)


-
 ## References

 -   [Google SRE book: Monitoring Distributed
@@ -76,7 +76,7 @@ a system, analyzing the data to derive meaningful information, and
 displaying the data to the users. In simple terms, you measure various
 metrics regularly to understand the state of the system, including but
 not limited to, user requests, latency, and error rate. *What gets
-measured, gets fixed*---if you can measure something, you can reason
+measured, gets fixed*&mdash;if you can measure something, you can reason
 about it, understand it, discuss it, and act upon it with confidence.


@@ -102,14 +102,14 @@ book](https://sre.google/sre-book/monitoring-distributed-systems/),
 if you can measure only four metrics of your service, focus on these
 four. Let's look at each of the four golden signals.

-   **Traffic** -- *Traffic* gives a better understanding of the service
+-   **Traffic**&mdash;*Traffic* gives a better understanding of the service
     demand. Often referred to as *service QPS* (queries per second),
     traffic is a measure of requests served by the service. This
     signal helps you to decide when a service needs to be scaled up to
     handle increasing customer demand and scaled down to be
     cost-effective.

-   **Latency** -- *Latency* is the measure of time taken by the service
+-   **Latency**&mdash;*Latency* is the measure of time taken by the service
     to process the incoming request and send the response. Measuring
     service latency helps in the early detection of slow degradation
     of the service. Distinguishing between the latency of successful
@@ -121,7 +121,7 @@ four. Let's look at each of the four golden signals.
     HTTP 500 error indicates a failed request, factoring 500s into
     overall latency might result in misleading calculations.

-   **Error (rate)** -- *Error* is the measure of failed client
+-   **Error (rate)**&mdash;*Error* is the measure of failed client
     requests. These failures can be easily identified based on the
     response codes ([HTTP 5XX
     error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)).
@@ -136,7 +136,7 @@ four. Let's look at each of the four golden signals.
     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)))
     in place to capture errors in addition to the response codes.

-   **Saturation** -- *Saturation* is a measure of the resource
+-   **Saturation**&mdash;*Saturation* is a measure of the resource
     utilization by a service. This signal tells you the state of
     service resources and how full they are. These resources include
     memory, compute, network I/O, and so on. Service performance
@@ -168,17 +168,17 @@ service health. With access to historical data collected over time, you
 can build intelligent applications to address specific needs. Some of
 the key use cases follow:

-   **Reduction in time to resolve issues** -- With a good monitoring
+-   **Reduction in time to resolve issues**&mdash;With a good monitoring
     infrastructure in place, you can identify issues quickly and
     resolve them, which reduces the impact caused by the issues.

-   **Business decisions** -- Data collected over a period of time can
+-   **Business decisions**&mdash;Data collected over a period of time can
     help you make business decisions such as determining the product
     release cycle, which features to invest in, and geographical areas
     to focus on. Decisions based on long-term data can improve the
     overall product experience.

-   **Resource planning** -- By analyzing historical data, you can
+-   **Resource planning**&mdash;By analyzing historical data, you can
     forecast service compute-resource demands, and you can properly
     allocate resources. This allows financially effective decisions,
     with no compromise in end-user experience.
@@ -186,35 +186,35 @@ the key use cases follow:
 Before we dive deeper into monitoring, let's understand some basic
 terminologies.

-   **Metric** -- A metric is a quantitative measure of a particular
-     system attribute---for example, memory or CPU
+-   **Metric**&mdash;A metric is a quantitative measure of a particular
+     system attribute&mdash;for example, memory or CPU

-   **Node or host** -- A physical server, virtual machine, or container
+-   **Node or host**&mdash;A physical server, virtual machine, or container
     where an application is running

-   **QPS** -- *Queries Per Second*, a measure of traffic served by the
+-   **QPS**&mdash;*Queries Per Second*, a measure of traffic served by the
     service per second

-   **Latency** -- The time interval between user action and the
-     response from the server---for example, time spent after sending a
+-   **Latency**&mdash;The time interval between user action and the
+     response from the server&mdash;for example, time spent after sending a
     query to a database before the first response bit is received

-   **Error** **rate** -- Number of errors observed over a particular
+-   **Error** **rate**&mdash;Number of errors observed over a particular
     time period (usually a second)

-   **Graph** -- In monitoring, a graph is a representation of one or
+-   **Graph**&mdash;In monitoring, a graph is a representation of one or
     more values of metrics collected over time

-   **Dashboard** -- A dashboard is a collection of graphs that provide
+-   **Dashboard**&mdash;A dashboard is a collection of graphs that provide
     an overview of system health

-   **Incident** -- An incident is an event that disrupts the normal
+-   **Incident**&mdash;An incident is an event that disrupts the normal
     operations of a system

-   **MTTD** -- *Mean Time To Detect* is the time interval between the
+-   **MTTD**&mdash;*Mean Time To Detect* is the time interval between the
     beginning of a service failure and the detection of such failure

-   **MTTR** -- Mean Time To Resolve is the time spent to fix a service
+-   **MTTR**&mdash;Mean Time To Resolve is the time spent to fix a service
     failure and bring the service back to its normal state

 Before we discuss monitoring an application, let us look at the
@@ -230,7 +230,7 @@ In addition, a monitoring infrastructure includes alert subsystems for
 notifying concerned parties during any abnormal behavior. Let's look at
 each of these infrastructure components:

-   **Host metrics agent --** A *host metrics agent* is a process
+-   **Host metrics agent**&mdash;A *host metrics agent* is a process
     running on the host that collects performance statistics for host
     subsystems such as memory, CPU, and network. These metrics are
     regularly relayed to a metrics collector for storage and
@@ -239,7 +239,7 @@ each of these infrastructure components:
     [telegraf](https://www.influxdata.com/time-series-platform/telegraf/),
     and [metricbeat](https://www.elastic.co/beats/metricbeat).

-   **Metric aggregator --** A *metric aggregator* is a process running
+-   **Metric aggregator**&mdash;A *metric aggregator* is a process running
     on the host. Applications running on the host collect service
     metrics using
     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)).
@@ -249,7 +249,7 @@ each of these infrastructure components:
     collector in batches. An example is
     [StatsD](https://github.com/statsd/statsd).

-   **Metrics collector --** A *metrics collector* process collects all
+-   **Metrics collector**&mdash;A *metrics collector* process collects all
     the metrics from the metric aggregators running on multiple hosts.
     The collector takes care of decoding and stores this data on the
     database. Metric collection and storage might be taken care of by
@@ -258,19 +258,19 @@ each of these infrastructure components:
     next. An example is [carbon
     daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html).

-   **Storage --** A time-series database stores all of these metrics.
+-   **Storage**&mdash;A time-series database stores all of these metrics.
     Examples are [OpenTSDB](http://opentsdb.net/),
     [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html),
     and [InfluxDB](https://www.influxdata.com/).

-   **Metrics server --** A *metrics server* can be as basic as a web
+-   **Metrics server**&mdash;A *metrics server* can be as basic as a web
     server that graphically renders metric data. In addition, the
     metrics server provides aggregation functionalities and APIs for
     fetching metric data programmatically. Some examples are
     [Grafana](https://github.com/grafana/grafana) and
     [Graphite-Web](https://github.com/graphite-project/graphite-web).

-   **Alert manager --** The *alert manager* regularly polls metric data
+-   **Alert manager**&mdash;The *alert manager* regularly polls metric data
     available and, if there are any anomalies detected, notifies you.
     Each alert has a set of rules for identifying such anomalies.
     Today many metrics servers such as
@@ -3,7 +3,7 @@
 # Observability

 Engineers often use observability when referring to building reliable
-systems. *Observability* is a term derived from control theory, It is a
+systems. *Observability* is a term derived from control theory, it is a
 measure of how well internal states of a system can be inferred from
 knowledge of its external outputs. Service infrastructures used on a
 daily basis are becoming more and more complex; proactive monitoring
@@ -82,7 +82,7 @@ Figure 10 shows a log processing platform using ELK (Elasticsearch,
 Logstash, Kibana), which provides centralized log processing. Beats is a
 collection of lightweight data shippers that can ship logs, audit data,
 network data, and so on over the network. In this use case specifically,
-we are using filebeat as a log shipper. Filebeat watches service log
+we are using Filebeat as a log shipper. Filebeat watches service log
 files and ships the log data to Logstash. Logstash parses these logs and
 transforms the data, preparing it to store on Elasticsearch. Transformed
 log data is stored on Elasticsearch and indexed for fast retrieval.
@@ -8,13 +8,13 @@ addition, a number of companies such as
 monitoring-as-a-service. In this section, we are not covering
 monitoring-as-a-service in depth.

-In recent years, more and more people have access to the internet. Many
+In recent years, more and more people have access to the Internet. Many
 services are offered online to cater to the increasing user base. As a
 result, web pages are becoming larger, with increased client-side
 scripts. Users want these services to be fast and error-free. From the
 service point of view, when the response body is composed, an HTTP 200
 OK response is sent, and everything looks okay. But there might be
-errors during transmission or on the client side. As previously
+errors during transmission or on the client-side. As previously
 mentioned, monitoring services from within the service infrastructure
 give good visibility into service health, but this is not enough. You
 need to monitor user experience, specifically the availability of
@@ -29,7 +29,7 @@ service is globally accessible. Other third-party monitoring solutions
 for real user monitoring (RUM) provide performance statistics such as
 service uptime and response time, from different geographical locations.
 This allows you to monitor the user experience from these locations,
-which might have different internet backbones, different operating
+which might have different Internet backbones, different operating
 systems, and different browsers and browser versions. [Catchpoint
 Global Monitoring
 Network](https://pages.catchpoint.com/overview-video) is a