diff --git a/courses/big_data/evolution.md b/courses/big_data/evolution.md deleted file mode 100644 index cb665b7..0000000 --- a/courses/big_data/evolution.md +++ /dev/null @@ -1,84 +0,0 @@ -# Evolution of Hadoop - - - -# Architecture of Hadoop - -1. **HDFS** - 1. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. - 2. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. - 3. HDFS is part of the [Apache Hadoop Core project](https://github.com/apache/hadoop). - -  - - 1. NameNode: is the arbitrator and central repository of file namespace in the cluster. The NameNode executes the operations such as opening, closing, and renaming files and directories. - 2. DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and writes requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks. - 3. Client: Responsible for getting the required metadata from the namenode and then communicating with the datanodes for reads and writes. - -2. **YARN** - 1. YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known as a large-scale distributed operating system used for Big Data processing. - 2. The main components of YARN architecture include: - -  - - 1. Client: It submits map-reduce(MR) jobs to the resource manager. - 2. Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components: - 3. Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, which means that it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources. - 4. Application manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails. - 5. Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep up with the Node Manager. It monitors resource usage, performs log management, and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it at the request of the Application master. - 6. Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time. - 7. Container: It is a collection of physical resources such as RAM, CPU cores, and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies, etc. - - -# MapReduce framework - - - -1. The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key-value pairs. Reduce job takes the output of the Map job i.e. the key-value pairs and aggregates them to produce desired results. -2. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once. -3. Please find the below Word count example demonstrating the usage of the MapReduce framework: - - - - -# Other tooling around Hadoop - -1. [**Hive**](https://hive.apache.org/) - 1. Uses a language called HQL which is very SQL like. Gives non-programmers the ability to query and analyze data in Hadoop. Is basically an abstraction layer on top of map-reduce. - 2. Ex. HQL query: - 1. _SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);_ - 3. In mysql: - 1. _SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;_ -2. [**Pig**](https://pig.apache.org/) - 1. Uses a scripting language called Pig Latin, which is more workflow driven. Don't need to be an expert Java programmer but need a few coding skills. Is also an abstraction layer on top of map-reduce. - 2. Here is a quick question for you: - What is the output of running the pig queries in the right column against the data present in the left column in the below image? - -  - - Output: - ``` - 7,Komal,Nayak,24,9848022334,trivendram - 8,Bharathi,Nambiayar,24,9848022333,Chennai - 5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar - 6,Archana,Mishra,23,9848022335,Chennai - ``` - -3. [**Spark**](https://spark.apache.org/) - 1. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms. -4. [**Presto**](https://prestodb.io/) - 1. Presto is a high performance, distributed SQL query engine for Big Data. - 2. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. - 3. Example presto query: - ``` - use studentDB; - show tables; - SELECT roll_no, name FROM studentDB.studentDetails where section=’A’ limit 5; - ``` - - -# Data Serialisation and storage - -1. In order to transport the data over the network or to store on some persistent storage, we use the process of translating data structures or objects state into binary or textual form. We call this process serialization.. -2. Avro data is stored in a container file (a .avro file) and its schema (the .avsc file) is stored with the data file. -3. Apache Hive provides support to store a table as Avro and can also query data in this serialisation format. diff --git a/courses/big_data/images/hadoop_evolution.png b/courses/big_data/images/hadoop_evolution.png deleted file mode 100644 index 849d83c..0000000 Binary files a/courses/big_data/images/hadoop_evolution.png and /dev/null differ diff --git a/courses/big_data/images/hdfs_architecture.png b/courses/big_data/images/hdfs_architecture.png deleted file mode 100644 index 63d356c..0000000 Binary files a/courses/big_data/images/hdfs_architecture.png and /dev/null differ diff --git a/courses/big_data/images/map_reduce.jpg b/courses/big_data/images/map_reduce.jpg deleted file mode 100644 index 226f99d..0000000 Binary files a/courses/big_data/images/map_reduce.jpg and /dev/null differ diff --git a/courses/big_data/images/mapreduce_example.jpg b/courses/big_data/images/mapreduce_example.jpg deleted file mode 100644 index 67e82a2..0000000 Binary files a/courses/big_data/images/mapreduce_example.jpg and /dev/null differ diff --git a/courses/big_data/images/pig_example.png b/courses/big_data/images/pig_example.png deleted file mode 100644 index 2132a79..0000000 Binary files a/courses/big_data/images/pig_example.png and /dev/null differ diff --git a/courses/big_data/images/yarn_architecture.gif b/courses/big_data/images/yarn_architecture.gif deleted file mode 100644 index 3a560ce..0000000 Binary files a/courses/big_data/images/yarn_architecture.gif and /dev/null differ diff --git a/courses/big_data/intro.md b/courses/big_data/intro.md deleted file mode 100644 index 751ef33..0000000 --- a/courses/big_data/intro.md +++ /dev/null @@ -1,57 +0,0 @@ -# Big Data - -## Prerequisites - -- Basics of Linux File systems. -- Basic understanding of System Design. - -## What to expect from this course - -This course covers the basics of Big Data and how it has evolved to become what it is today. We will take a look at a few realistic scenarios where Big Data would be a perfect fit. An interesting assignment on designing a Big Data system is followed by understanding the architecture of Hadoop and the tooling around it. - -## What is not covered under this course - -Writing programs to draw analytics from data. - -## Course Contents - -1. [Overview of Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/#overview-of-big-data) -2. [Usage of Big Data techniques](https://linkedin.github.io/school-of-sre/big_data/intro/#usage-of-big-data-techniques) -3. [Evolution of Hadoop](https://linkedin.github.io/school-of-sre/big_data/evolution/) -4. [Architecture of hadoop](https://linkedin.github.io/school-of-sre/big_data/evolution/#architecture-of-hadoop) - 1. HDFS - 2. Yarn -5. [MapReduce framework](https://linkedin.github.io/school-of-sre/big_data/evolution/#mapreduce-framework) -6. [Other tooling around hadoop](https://linkedin.github.io/school-of-sre/big_data/evolution/#other-tooling-around-hadoop) - 1. Hive - 2. Pig - 3. Spark - 4. Presto -7. [Data Serialisation and storage](https://linkedin.github.io/school-of-sre/big_data/evolution/#data-serialisation-and-storage) - - -# Overview of Big Data - -1. Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques, and frameworks. -2. Big Data could consist of - 1. Structured data - 2. Unstructured data - 3. Semi-structured data -3. Characteristics of Big Data: - 1. Volume - 2. Variety - 3. Velocity - 4. Variability -4. Examples of Big Data generation include stock exchanges, social media sites, jet engines, etc. - - -# Usage of Big Data Techniques - -1. Take the example of the traffic lights problem. - 1. There are more than 300,000 traffic lights in the US as of 2018. - 2. Let us assume that we placed a device on each of them to collect metrics and send it to a central metrics collection system. - 3. If each of the IoT devices sends 10 events per minute, we have 300000x10x60x24 = 432x10^7 events per day. - 4. How would you go about processing that and telling me how many of the signals were “green” at 10:45 am on a particular day? -2. Consider the next example on Unified Payments Interface (UPI) transactions: - 1. We had about 1.15 billion UPI transactions in the month of October 2019 in India. - 12. If we try to extrapolate this data to about a year and try to find out some common payments that were happening through a particular UPI ID, how do you suggest we go about that? diff --git a/courses/big_data/tasks.md b/courses/big_data/tasks.md deleted file mode 100644 index fc2ec5e..0000000 --- a/courses/big_data/tasks.md +++ /dev/null @@ -1,14 +0,0 @@ -# Tasks and conclusion - -## Post-training tasks: - -1. Try setting up your own 3 node Hadoop cluster. - 1. A VM based solution can be found [here](http://hortonworks.com/wp-content/uploads/2015/04/Import_on_VBox_4_07_2015.pdf) -2. Write a simple spark/MR job of your choice and understand how to generate analytics from data. - 1. Sample dataset can be found [here](https://grouplens.org/datasets/movielens/) - -## References: -1. [Hadoop documentation](http://hadoop.apache.org/docs/current/) -2. [HDFS Architecture](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) -3. [YARN Architecture](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) -4. [Google GFS paper](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf) diff --git a/courses/databases_nosql/further_reading.md b/courses/databases_nosql/further_reading.md deleted file mode 100644 index f08d944..0000000 --- a/courses/databases_nosql/further_reading.md +++ /dev/null @@ -1,29 +0,0 @@ -# Conclusion - -We have covered basic concepts of NoSQL databases. There is much more to learn and do. We hope this course gives you a good start and inspires you to explore further. - -# Further reading - -NoSQL: - -[https://hostingdata.co.uk/nosql-database/](https://hostingdata.co.uk/nosql-database/) - -[https://www.mongodb.com/nosql-explained](https://www.mongodb.com/nosql-explained) - -[https://www.mongodb.com/nosql-explained/nosql-vs-sql](https://www.mongodb.com/nosql-explained/nosql-vs-sql) - -Cap Theorem - -[http://www.julianbrowne.com/article/brewers-cap-theorem](http://www.julianbrowne.com/article/brewers-cap-theorem) - -Scalability - -[http://www.slideshare.net/jboner/scalability-availability-stability-patterns](http://www.slideshare.net/jboner/scalability-availability-stability-patterns) - -Eventual Consistency - -[https://www.allthingsdistributed.com/2008/12/eventually_consistent.html](https://www.allthingsdistributed.com/2008/12/eventually_consistent.html) - -[https://www.toptal.com/big-data/consistent-hashing](https://www.toptal.com/big-data/consistent-hashing) - -[https://web.stanford.edu/class/cs244/papers/chord_TON_2003.pdf](https://web.stanford.edu/class/cs244/papers/chord_TON_2003.pdf) diff --git a/courses/databases_nosql/images/Quorum.png b/courses/databases_nosql/images/Quorum.png deleted file mode 100644 index 8e79ec5..0000000 Binary files a/courses/databases_nosql/images/Quorum.png and /dev/null differ diff --git a/courses/databases_nosql/images/cluster_quorum.png b/courses/databases_nosql/images/cluster_quorum.png deleted file mode 100644 index 1091fb4..0000000 Binary files a/courses/databases_nosql/images/cluster_quorum.png and /dev/null differ diff --git a/courses/databases_nosql/images/consistent_hashing.png b/courses/databases_nosql/images/consistent_hashing.png deleted file mode 100644 index 31564bc..0000000 Binary files a/courses/databases_nosql/images/consistent_hashing.png and /dev/null differ diff --git a/courses/databases_nosql/images/database_sharding.png b/courses/databases_nosql/images/database_sharding.png deleted file mode 100644 index b3f83db..0000000 Binary files a/courses/databases_nosql/images/database_sharding.png and /dev/null differ diff --git a/courses/databases_nosql/images/vector_clocks.png b/courses/databases_nosql/images/vector_clocks.png deleted file mode 100644 index c4e9361..0000000 Binary files a/courses/databases_nosql/images/vector_clocks.png and /dev/null differ diff --git a/courses/databases_nosql/intro.md b/courses/databases_nosql/intro.md deleted file mode 100644 index 9eb6291..0000000 --- a/courses/databases_nosql/intro.md +++ /dev/null @@ -1,219 +0,0 @@ -# NoSQL Concepts - -## Prerequisites -- [Relational Databases](https://linkedin.github.io/school-of-sre/databases_sql/intro/) - -## What to expect from this course - -At the end of training, you will have an understanding of what a NoSQL database is, what kind of advantages or disadvantages it has over traditional RDBMS, learn about different types of NoSQL databases and understand some of the underlying concepts & trade offs w.r.t to NoSQL. - - -## What is not covered under this course - -We will not be deep diving into any specific NoSQL Database. - - -## Course Contents - - - -* [Introduction to NoSQL](https://linkedin.github.io/school-of-sre/databases_nosql/intro/#introduction) -* [CAP Theorem](https://linkedin.github.io/school-of-sre/databases_nosql/key_concepts/#cap-theorem) -* [Data versioning](https://linkedin.github.io/school-of-sre/databases_nosql/key_concepts/#versioning-of-data-in-distributed-systems) -* [Partitioning](https://linkedin.github.io/school-of-sre/databases_nosql/key_concepts/#partitioning) -* [Hashing](https://linkedin.github.io/school-of-sre/databases_nosql/key_concepts/#hashing) -* [Quorum](https://linkedin.github.io/school-of-sre/databases_nosql/key_concepts/#quorum) - - -## Introduction - -When people use the term “NoSQL database”, they typically use it to refer to any non-relational database. Some say the term “NoSQL” stands for “non SQL” while others say it stands for “not only SQL.” Either way, most agree that NoSQL databases are databases that store data in a format other than relational tables. - -A common misconception is that NoSQL databases or non-relational databases don’t store relationship data well. NoSQL databases can store relationship data—they just store it differently than relational databases do. In fact, when compared with SQL databases, many find modeling relationship data in NoSQL databases to be _easier_, because related data doesn’t have to be split between tables. - -Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century. NASA used a NoSQL database to track inventory for the Apollo mission. NoSQL databases emerged in the late 2000s as the cost of storage dramatically decreased. Gone were the days of needing to create a complex, difficult-to-manage data model simply for the purposes of reducing data duplication. Developers (rather than storage) were becoming the primary cost of software development, so NoSQL databases optimized for developer productivity. With the rise of Agile development methodology, NoSQL databases were developed with a focus on scaling, fast performance and at the same time allowed for frequent application changes and made programming easier. - - -### Types of NoSQL databases: - -Over time due to the way these NoSQL databases were developed to suit requirements at different companies, we ended up with quite a few types of them. However, they can be broadly classified into 4 types. Some of the databases can overlap between different types. They are - - - -1. **Document databases:** They store data in documents similar to [JSON](https://www.json.org/json-en.html) (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects, and their structures typically align with objects developers are working with in code. The advantages include intuitive data model & flexible schemas. Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general purpose database. They can horizontally scale-out to accomodate large data volumes. Ex: MongoDB, Couchbase -2. **Key-Value databases:** These are a simpler type of databases where each item contains keys and values. A value can typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically simple. Key-value databases are great for use cases where you need to store large amounts of data but you don’t need to perform complex queries to retrieve it. Common use cases include storing user preferences or caching. Ex: [Redis](https://redis.io/), [DynamoDB](https://aws.amazon.com/dynamodb/), [Voldemort](https://www.project-voldemort.com/voldemort/)/[Venice](https://engineering.linkedin.com/blog/2017/04/building-venice--a-production-software-case-study) (Linkedin), -3. **Wide-Column stores:** They store data in tables, rows, and dynamic columns. Wide-column stores provide a lot of flexibility over relational databases because each row is not required to have the same columns. Many consider wide-column stores to be two-dimensional key-value databases. Wide-column stores are great for when you need to store large amounts of data and you can predict what your query patterns will be. Wide-column stores are commonly used for storing Internet of Things data and user profile data. [Cassandra](https://cassandra.apache.org/) and [HBase](https://hbase.apache.org/) are two of the most popular wide-column stores. -4. **Graph Databases:** These databases store data in nodes and edges. Nodes typically store information about people, places, and things while edges store information about the relationships between the nodes. The underlying storage mechanism of graph databases can vary. Some depend on a relational engine and “store” the graph data in a table (although a table is a logical element, therefore this approach imposes another level of abstraction between the graph database, the graph database management system and the physical devices where the data is actually stored). Others use a key-value store or document-oriented database for storage, making them inherently NoSQL structures. Graph databases excel in use cases where you need to traverse relationships to look for patterns such as social networks, fraud detection, and recommendation engines. Ex: [Neo4j](https://neo4j.com/) - - -### **Comparison** - - -
| - | -Performance - | -Scalability - | -Flexibility - | -Complexity - | -Functionality - | -
| Key Value - | -high - | -high - | -high - | -none - | -Variable - | -
| Document stores - | -high - | -Variable (high) - | -high - | -low - | -Variable (low) - | -
| Column DB - | -high - | -high - | -moderate - | -low - | -minimal - | -
| Graph - | -Variable - | -Variable - | -high - | -high - | -Graph theory - | -
| - | -SQL Databases - | -NoSQL Databases - | -
| Data Storage Model - | -Tables with fixed rows and columns - | -Document: JSON documents, Key-value: key-value pairs, Wide-column: tables with rows and dynamic columns, Graph: nodes and edges - | -
| Primary Purpose - | -General purpose - | -Document: general purpose, Key-value: large amounts of data with simple lookup queries, Wide-column: large amounts of data with predictable query patterns, Graph: analyzing and traversing relationships between connected data - | -
| Schemas - | -Rigid - | -Flexible - | -
| Scaling - | -Vertical (scale-up with a larger server) - | -Horizontal (scale-out across commodity servers) - | -
| Multi-Record ACID Transactions - | -Supported - | -Most do not support multi-record ACID transactions. However, some—like MongoDB—do. - | -
| Joins - | -Typically required - | -Typically not required - | -
| Data to Object Mapping - | -Requires ORM (object-relational mapping) - | -Many do not require ORMs. Document DB documents map directly to data structures in most popular programming languages. - | -
| Choice - | -Traits - | -Examples - | -
| Consistency + Availability
- -(Forfeit Partitions) - |
- 2-phase commits
- -Cache invalidation protocols - |
- Single-site databases Cluster databases
- -LDAP - -xFS file system - |
-
| Consistency + Partition tolerance
- - (Forfeit Availability) - |
- Pessimistic locking
- -Make minority partitions unavailable - |
- Distributed databases Distributed locking Majority protocols - | -
| Availability + Partition tolerance (Forfeit Consistency) - | -expirations/leases
- -conflict resolution optimistic - |
- DNS
- -Web caching - |
-
- - - - - - - -
Vector clocks illustration
- -Vector clocks have the following advantages over other conflict resolution mechanism - - - -1. No dependency on synchronized clocks -2. No total ordering of revision nos required for casual reasoning - -No need to store and maintain multiple versions of the data on different nodes.** ** - - -### Partitioning - -When the amount of data crosses the capacity of a single node, we need to think of splitting data, creating replicas for load balancing & disaster recovery. Depending on how dynamic the infrastructure is, we have a few approaches that we can take. - - - -1. **Memory cached** - - These are partitioned in-memory databases that are primarily used for transient data. These databases are generally used as a front for traditional RDBMS. Most frequently used data is replicated from a rdbms into a memory database to facilitate fast queries and to take the load off from backend DB’s. A very common example is memcached or couchbase. - -2. **Clustering** - - Traditional cluster mechanisms abstract away the cluster topology from clients. A client need not know where the actual data is residing and which node it is talking to. Clustering is very commonly used in traditional RDBMS where it can help scaling the persistent layer to a certain extent. - -3. **Separating reads from writes** - - In this method, you will have multiple replicas hosting the same data. The incoming writes are typically sent to a single node (Leader) or multiple nodes (multi-Leader), while the rest of the replicas (Follower) handle reads requests. The leader replicates writes asynchronously to all followers. However the write lag can’t be completely avoided. Sometimes a leader can crash before it replicates all the data to a follower. When this happens, a follower with the most consistent data can be turned into a leader. As you can realize now, it is hard to enforce full consistency in this model. You also need to consider the ratio of read vs write traffic. This model won’t make sense when writes are higher than reads. The replication methods can also vary widely. Some systems do a complete transfer of state periodically, while others use a delta state transfer approach. You could also transfer the state by transferring the operations in order. The followers can then apply the same operations as the leader to catch up. - -4. **Sharding** - - Sharing refers to dividing data in such a way that data is distributed evenly (both in terms of storage & processing power) across a cluster of nodes. It can also imply data locality, which means similar & related data is stored together to facilitate faster access. A shard in turn can be further replicated to meet load balancing or disaster recovery requirements. A single shard replica might take in all writes (single leader) or multiple replicas can take writes (multi-leader). Reads can be distributed across multiple replicas. Since data is now distributed across multiple nodes, clients should be able to consistently figure out where data is hosted. We will look at some of the common techniques below. The downside of sharding is that joins between shards is not possible. So an upstream/downstream application has to aggregate the results from multiple shards. - -- - - - - -
Sharding example
- - -### Hashing - -A hash function is a function that maps one piece of data—typically describing some kind of object, often of arbitrary size—to another piece of data, typically an integer, known as _hash code_, or simply _hash_. In a partitioned database, it is important to consistently map a key to a server/replica. - -For ex: you can use a very simple hash as a modulo function. - - - _p = k mod n_ - -Where - - - p -> partition, - - - k -> primary key - - - n -> no of nodes - -The downside of this simple hash is that, whenever the cluster topology changes, the data distribution also changes. When you are dealing with memory caches, it will be easy to distribute partitions around. Whenever a node joins/leaves a topology, partitions can reorder themselves, a cache miss can be re-populated from backend DB. However when you look at persistent data, it is not possible as the new node doesn’t have the data needed to serve it. This brings us to consistent hashing. - - -#### Consistent Hashing - -Consistent hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed _hash table_ by assigning them a position on an abstract circle, or _hash ring_. This allows servers and objects to scale without affecting the overall system. - -Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest integer that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with a hash larger than h(k), we wrap around and start looking from the beginning of the array. - -- - - - - -
Consistent hashing illustration
- -In consistent hashing when a server is removed or added then only the keys from that server are relocated. For example, if server S3 is removed then, all keys from server S3 will be moved to server S4 but keys stored on server S4 and S2 are not relocated. But there is one problem, when server S3 is removed then keys from S3 are not equally distributed among remaining servers S4 and S2. They are only assigned to server S4 which increases the load on server S4. - -To evenly distribute the load among servers when a server is added or removed, it creates a fixed number of replicas ( known as virtual nodes) of each server and distributes it along the circle. So instead of server labels S1, S2 and S3, we will have S10 S11…S19, S20 S21…S29 and S30 S31…S39. The factor for a number of replicas is also known as _weight_, depending on the situation. - - - - -All keys which are mapped to replicas Sij are stored on server Si. To find a key we do the same thing, find the position of the key on the circle and then move forward until you find a server replica. If the server replica is Sij then the key is stored in server Si. - -Suppose server S3 is removed, then all S3 replicas with labels S30 S31 … S39 must be removed. Now the objects keys adjacent to S3X labels will be automatically re-assigned to S1X, S2X and S4X. All keys originally assigned to S1, S2 & S4 will not be moved. - -Similar things happen if we add a server. Suppose we want to add a server S5 as a replacement of S3 then we need to add labels S50 S51 … S59. In the ideal case, one-fourth of keys from S1, S2 and S4 will be reassigned to S5. - -When applied to persistent storages, further issues arise: if a node has left the scene, data stored on this node becomes unavailable, unless it has been replicated to other nodes before; in the opposite case of a new node joining the others, adjacent nodes are no longer responsible for some pieces of data which they still store but not get asked for anymore as the corresponding objects are no longer hashed to them by requesting clients. In order to address this issue, a replication factor (r) can be introduced. - -Introducing replicas in a partitioning scheme—besides reliability benefits—also makes it possible to spread workload for read requests that can go to any physical node responsible for a requested piece of data. Scalability doesn’t work if the clients have to decide between multiple versions of the dataset, because they need to read from a quorum of servers which in turn reduces the efficiency of load balancing. - - - - -### Quorum - -Quorum is the minimum number of nodes in a cluster that must be online and be able to communicate with each other. If any additional node failure occurs beyond this threshold, the cluster will stop running. - - - - - -To attain a quorum, you need a majority of the nodes. Commonly it is (N/2 + 1), where N is the total no of nodes in the system. For ex, - -In a 3 node cluster, you need 2 nodes for a majority, - -In a 5 node cluster, you need 3 nodes for a majority, - -In a 6 node cluster, you need 4 nodes for a majority. - -- - - - -
Quorum example
- - - -Network problems can cause communication failures among cluster nodes. One set of nodes might be able to communicate together across a functioning part of a network but not be able to communicate with a different set of nodes in another part of the network. This is known as split brain in cluster or cluster partitioning. - -Now the partition which has quorum is allowed to continue running the application. The other partitions are removed from the cluster. - -Eg: In a 5 node cluster, consider what happens if nodes 1, 2, and 3 can communicate with each other but not with nodes 4 and 5. Nodes 1, 2, and 3 constitute a majority, and they continue running as a cluster. Nodes 4 and 5, being a minority, stop running as a cluster. If node 3 loses communication with other nodes, all nodes stop running as a cluster. However, all functioning nodes will continue to listen for communication, so that when the network begins working again, the cluster can form and begin to run. - -Below diagram demonstrates Quorum selection on a cluster partitioned into two sets. - -- - - - -**
Cluster Quorum example
** - diff --git a/courses/databases_sql/backup_recovery.md b/courses/databases_sql/backup_recovery.md deleted file mode 100644 index 81867e4..0000000 --- a/courses/databases_sql/backup_recovery.md +++ /dev/null @@ -1,219 +0,0 @@ -### Backup and Recovery -Backups are a very crucial part of any database setup. They are generally a copy of the data that can be used to reconstruct the data in case of any major or minor crisis with the database. In general terms backups can be of two types:- - -- **Physical Backup** - the data directory as it is on the disk -- **Logical Backup** - the table structure and records in it - -Both the above kinds of backups are supported by MySQL with different tools. It is the job of the SRE to identify which should be used when. - -#### Mysqldump -This utility is available with MySQL installation. It helps in getting the logical backup of the database. It outputs a set of SQL statements to reconstruct the data. It is not recommended to use mysqldump for large tables as it might take a lot of time and the file size will be huge. However, for small tables it is the best and the quickest option. - -`mysqldump [options] > dump_output.sql` - -There are certain options that can be used with mysqldump to get an appropriate dump of the database. - -To dump all the databases - -`mysqldump -uFigure 8: An alert notification received on Slack
- -Today most of the monitoring services available provide a mechanism to -set up alerts on one or a combination of metrics to actively monitor the -service health. These alerts have a set of defined rules or conditions, -and when the rule is broken, you are notified. These rules can be as -simple as notifying when the metric value exceeds n to as complex as a -week over week (WoW) comparison of standard deviation over a period of -time. Monitoring tools notify you about an active alert, and most of -these tools support instant messaging (IM) platforms, SMS, email, or -phone calls. Figure 8 shows a sample alert notification received on -Slack for memory usage exceeding 90 percent of total RAM space on the -host. diff --git a/courses/metrics_and_monitoring/best_practices.md b/courses/metrics_and_monitoring/best_practices.md deleted file mode 100644 index 5454bde..0000000 --- a/courses/metrics_and_monitoring/best_practices.md +++ /dev/null @@ -1,40 +0,0 @@ -## - -# Best practices for monitoring - -When setting up monitoring for a service, keep the following best -practices in mind. - -- **Use the right metric type** -- Most of the libraries available - today offer various metric types. Choose the appropriate metric - type for monitoring your system. Following are the types of - metrics and their purposes. - - - **Gauge --** *Gauge* is a constant type of metric. After the - metric is initialized, the metric value does not change unless - you intentionally update it. - - - **Timer --** *Timer* measures the time taken to complete a - task. - - - **Counter --** *Counter* counts the number of occurrences of a - particular event. - - For more information about these metric types, see [Data - Types](https://statsd.readthedocs.io/en/v0.5.0/types.html). - -- **Avoid over-monitoring** -- Monitoring can be a significant - engineering endeavor***.*** Therefore, be sure not to spend too - much time and resources on monitoring services, yet make sure all - important metrics are captured. - -- **Prevent alert fatigue** -- Set alerts for metrics that are - important and actionable. If you receive too many non-critical - alerts, you might start ignoring alert notifications over time. As - a result, critical alerts might get overlooked. - -- **Have a runbook for alerts** -- For every alert, make sure you have - a document explaining what actions and checks need to be performed - when the alert fires. This enables any engineer on the team to - handle the alert and take necessary actions, without any help from - others. \ No newline at end of file diff --git a/courses/metrics_and_monitoring/command-line_tools.md b/courses/metrics_and_monitoring/command-line_tools.md deleted file mode 100644 index 987466b..0000000 --- a/courses/metrics_and_monitoring/command-line_tools.md +++ /dev/null @@ -1,101 +0,0 @@ -## - -# Command-line tools -Most of the Linux distributions today come with a set of tools that -monitor the system's performance. These tools help you measure and -understand various subsystem statistics (CPU, memory, network, and so -on). Let's look at some of the tools that are predominantly used. - -- `ps/top `-- The process status command (ps) displays information - about all the currently running processes in a Linux system. The - top command is similar to the ps command, but it periodically - updates the information displayed until the program is terminated. - An advanced version of top, called htop, has a more user-friendly - interface and some additional features. These command-line - utilities come with options to modify the operation and output of - the command. Following are some important options supported by the - ps command. - - - `-pFigure 2: Results of top command
- -- `ss` -- The socket statistics command (ss) displays information - about network sockets on the system. This tool is the successor of - [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html), - which is deprecated. Following are some command-line options - supported by the ss command: - - - `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP - sockets, `-x` is for UNIX domain sockets, and so on. - - - `-l` -- Displays only listening sockets. - - - `-n` -- Instructs the command to not resolve service names. - Instead displays the port numbers. - -Figure -3: List of listening sockets on a system
- -- `free` -- The free command displays memory usage statistics on the - host like available memory, used memory, and free memory. Most often, - this command is used with the `-h` command-line option, which - displays the statistics in a human-readable format. - - -Figure 4: Memory statistics on a host in human-readable form
- -- `df --` The df command displays disk space usage statistics. The - `-i` command-line option is also often used to display - [inode](https://en.wikipedia.org/wiki/Inode) usage - statistics. The `-h` command-line option is used for displaying - statistics in a human-readable format. - - -Figure 5: - Disk usage statistics on a system in human-readable form
- -- `sar` -- The sar utility monitors various subsystems, such as CPU - and memory, in real time. This data can be stored in a file - specified with the `-o` option. This tool helps to identify - anomalies. - -- `iftop` -- The interface top command (`iftop`) displays bandwidth - utilization by a host on an interface. This command is often used - to identify bandwidth usage by active connections. The `-i` option - specifies which network interface to watch. - - -Figure 6: Network bandwidth usage by -active connection on the host
- -- `tcpdump` -- The tcpdump command is a network monitoring tool that - captures network packets flowing over the network and displays a - description of the captured packets. The following options are - available: - - - `-iFigure 7: *tcpdump* of packets on *docker0* -interface on a host
\ No newline at end of file diff --git a/courses/metrics_and_monitoring/conclusion.md b/courses/metrics_and_monitoring/conclusion.md deleted file mode 100644 index 6d42651..0000000 --- a/courses/metrics_and_monitoring/conclusion.md +++ /dev/null @@ -1,52 +0,0 @@ -# Conclusion - -A robust monitoring and alerting system is necessary for maintaining and -troubleshooting a system. A dashboard with key metrics can give you an -overview of service performance, all in one place. Well-defined alerts -(with realistic thresholds and notifications) further enable you to -quickly identify any anomalies in the service infrastructure and in -resource saturation. By taking necessary actions, you can avoid any -service degradations and decrease MTTD for service breakdowns. - -In addition to in-house monitoring, monitoring real user experience can -help you to understand service performance as perceived by the users. -Many modules are involved in serving the user, and most of them are out -of your control. Therefore, you need to have real-user monitoring in -place. - -Metrics give very abstract details on service performance. To get a -better understanding of the system and for faster recovery during -incidents, you might want to implement the other two pillars of -observability: logs and tracing. Logs and trace data can help you -understand what led to service failure or degradation. - -Following are some resources to learn more about monitoring and -observability: - -- [Google SRE book: Monitoring Distributed - Systems](https://sre.google/sre-book/monitoring-distributed-systems/) - -- [Mastering Distributed Tracing by Yuri - Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/) - - - -## References - -- [Google SRE book: Monitoring Distributed - Systems](https://sre.google/sre-book/monitoring-distributed-systems/) - -- [Mastering Distributed Tracing, by Yuri - Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/) - -- [Monitoring and - Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c) - -- [Three PIllars with Zero - Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8) - -- Engineering blogs on - [LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring), - [Grafana](https://grafana.com/blog/), - [Elastic.co](https://www.elastic.co/blog/), - [OpenTelemetry](https://medium.com/opentelemetry) diff --git a/courses/metrics_and_monitoring/images/image1.jpg b/courses/metrics_and_monitoring/images/image1.jpg deleted file mode 100644 index 776248f..0000000 Binary files a/courses/metrics_and_monitoring/images/image1.jpg and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image10.png b/courses/metrics_and_monitoring/images/image10.png deleted file mode 100644 index 2bae97a..0000000 Binary files a/courses/metrics_and_monitoring/images/image10.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image11.png b/courses/metrics_and_monitoring/images/image11.png deleted file mode 100644 index 41bf3d3..0000000 Binary files a/courses/metrics_and_monitoring/images/image11.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image12.png b/courses/metrics_and_monitoring/images/image12.png deleted file mode 100644 index 1588af3..0000000 Binary files a/courses/metrics_and_monitoring/images/image12.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image2.png b/courses/metrics_and_monitoring/images/image2.png deleted file mode 100644 index c6cee36..0000000 Binary files a/courses/metrics_and_monitoring/images/image2.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image3.jpg b/courses/metrics_and_monitoring/images/image3.jpg deleted file mode 100644 index 26c68e9..0000000 Binary files a/courses/metrics_and_monitoring/images/image3.jpg and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image4.jpg b/courses/metrics_and_monitoring/images/image4.jpg deleted file mode 100644 index c3266d4..0000000 Binary files a/courses/metrics_and_monitoring/images/image4.jpg and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image5.jpg b/courses/metrics_and_monitoring/images/image5.jpg deleted file mode 100644 index 95abeaf..0000000 Binary files a/courses/metrics_and_monitoring/images/image5.jpg and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image6.png b/courses/metrics_and_monitoring/images/image6.png deleted file mode 100644 index 70115ae..0000000 Binary files a/courses/metrics_and_monitoring/images/image6.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image7.png b/courses/metrics_and_monitoring/images/image7.png deleted file mode 100644 index 55adbe6..0000000 Binary files a/courses/metrics_and_monitoring/images/image7.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image8.png b/courses/metrics_and_monitoring/images/image8.png deleted file mode 100644 index 67fab10..0000000 Binary files a/courses/metrics_and_monitoring/images/image8.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/images/image9.png b/courses/metrics_and_monitoring/images/image9.png deleted file mode 100644 index 9a71bf0..0000000 Binary files a/courses/metrics_and_monitoring/images/image9.png and /dev/null differ diff --git a/courses/metrics_and_monitoring/introduction.md b/courses/metrics_and_monitoring/introduction.md deleted file mode 100644 index 01d3491..0000000 --- a/courses/metrics_and_monitoring/introduction.md +++ /dev/null @@ -1,281 +0,0 @@ -## - -# Prerequisites - -- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/) - -- [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/) - -- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/) - -- [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/) - - -## What to expect from this course - -Monitoring is an integral part of any system. As an SRE, you need to -have a basic understanding of monitoring a service infrastructure. By -the end of this course, you will gain a better understanding of the -following topics: - -- What is monitoring? - - - What needs to be measured - - - How the metrics gathered can be used to improve business decisions and overall reliability - - - Proactive monitoring with alerts - - - Log processing and its importance - -- What is observability? - - - Distributed tracing - - - Logs - - - Metrics - -## What is not covered in this course - -- Guide to setting up a monitoring infrastructure - -- Deep dive into different monitoring technologies and benchmarking or comparison of any tools - - -## Course content - -- [Introduction](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#introduction) - - - [Four golden signals of monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#four-golden-signals-of-monitoring) - - - [Why is monitoring important?](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#why-is-monitoring-important) - -- [Command-line tools](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/command-line_tools/) - -- [Third-party monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/third-party_monitoring/) - -- [Proactive monitoring using alerts](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/alerts/) - -- [Best practices for monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/best_practices/) - -- [Observability](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/observability/) - - - [Logs](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/observability/#logs) - - [Tracing](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/bservability/#tracing) - -[Conclusion](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/conclusion/) - - -## - -# Introduction - -Monitoring is a process of collecting real-time performance metrics from -a system, analyzing the data to derive meaningful information, and -displaying the data to the users. In simple terms, you measure various -metrics regularly to understand the state of the system, including but -not limited to, user requests, latency, and error rate. *What gets -measured, gets fixed*---if you can measure something, you can reason -about it, understand it, discuss it, and act upon it with confidence. - - -## Four golden signals of monitoring - -When setting up monitoring for a system, you need to decide what to -measure. The four golden signals of monitoring provide a good -understanding of service performance and lay a foundation for monitoring -a system. These four golden signals are - -- Traffic - -- Latency - -- Error - -- Saturation - -These metrics help you to understand the system performance and -bottlenecks, and to create a better end-user experience. As discussed in -the [Google SRE -book](https://sre.google/sre-book/monitoring-distributed-systems/), -if you can measure only four metrics of your service, focus on these -four. Let's look at each of the four golden signals. - -- **Traffic** -- *Traffic* gives a better understanding of the service - demand. Often referred to as *service QPS* (queries per second), - traffic is a measure of requests served by the service. This - signal helps you to decide when a service needs to be scaled up to - handle increasing customer demand and scaled down to be - cost-effective. - -- **Latency** -- *Latency* is the measure of time taken by the service - to process the incoming request and send the response. Measuring - service latency helps in the early detection of slow degradation - of the service. Distinguishing between the latency of successful - requests and the latency of failed requests is important. For - example, an [HTTP 5XX - error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses) - triggered due to loss of connection to a database or other - critical backend might be served very quickly. However, because an - HTTP 500 error indicates a failed request, factoring 500s into - overall latency might result in misleading calculations. - -- **Error (rate)** -- *Error* is the measure of failed client - requests. These failures can be easily identified based on the - response codes ([HTTP 5XX - error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)). - There might be cases where the response is considered erroneous - due to wrong result data or due to policy violations. For example, - you might get an [HTTP - 200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200) - response, but the body has incomplete data, or response time is - breaching the agreed-upon - [SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s. - Therefore, you need to have other mechanisms (code logic or - [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming))) - in place to capture errors in addition to the response codes. - -- **Saturation** -- *Saturation* is a measure of the resource - utilization by a service. This signal tells you the state of - service resources and how full they are. These resources include - memory, compute, network I/O, and so on. Service performance - slowly degrades even before resource utilization is at 100 - percent. Therefore, having a utilization target is important. An - increase in latency is a good indicator of saturation; measuring - the [99th - percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf) - of latency can help in the early detection of saturation. - -Depending on the type of service, you can measure these signals in -different ways. For example, you might measure queries per second served -for a web server. In contrast, for a database server, transactions -performed and database sessions created give you an idea about the -traffic handled by the database server. With the help of additional code -logic (monitoring libraries and instrumentation), you can measure these -signals periodically and store them for future analysis. Although these -metrics give you an idea about the performance at the service end, you -need to also ensure that the same user experience is delivered at the -client end. Therefore, you might need to monitor the service from -outside the service infrastructure, which is discussed under third-party -monitoring. - -## Why is monitoring important? - -Monitoring plays a key role in the success of a service. As discussed -earlier, monitoring provides performance insights for understanding -service health. With access to historical data collected over time, you -can build intelligent applications to address specific needs. Some of -the key use cases follow: - -- **Reduction in time to resolve issues** -- With a good monitoring - infrastructure in place, you can identify issues quickly and - resolve them, which reduces the impact caused by the issues. - -- **Business decisions** -- Data collected over a period of time can - help you make business decisions such as determining the product - release cycle, which features to invest in, and geographical areas - to focus on. Decisions based on long-term data can improve the - overall product experience. - -- **Resource planning** -- By analyzing historical data, you can - forecast service compute-resource demands, and you can properly - allocate resources. This allows financially effective decisions, - with no compromise in end-user experience. - -Before we dive deeper into monitoring, let's understand some basic -terminologies. - -- **Metric** -- A metric is a quantitative measure of a particular - system attribute---for example, memory or CPU - -- **Node or host** -- A physical server, virtual machine, or container - where an application is running - -- **QPS** -- *Queries Per Second*, a measure of traffic served by the - service per second - -- **Latency** -- The time interval between user action and the - response from the server---for example, time spent after sending a - query to a database before the first response bit is received - -- **Error** **rate** -- Number of errors observed over a particular - time period (usually a second) - -- **Graph** -- In monitoring, a graph is a representation of one or - more values of metrics collected over time - -- **Dashboard** -- A dashboard is a collection of graphs that provide - an overview of system health - -- **Incident** -- An incident is an event that disrupts the normal - operations of a system - -- **MTTD** -- *Mean Time To Detect* is the time interval between the - beginning of a service failure and the detection of such failure - -- **MTTR** -- Mean Time To Resolve is the time spent to fix a service - failure and bring the service back to its normal state - -Before we discuss monitoring an application, let us look at the -monitoring infrastructure. Following is an illustration of a basic -monitoring system. - - -Figure 1: Illustration of a monitoring infrastructure
- -Figure 1 shows a monitoring infrastructure mechanism for aggregating -metrics on the system, and collecting and storing the data for display. -In addition, a monitoring infrastructure includes alert subsystems for -notifying concerned parties during any abnormal behavior. Let's look at -each of these infrastructure components: - -- **Host metrics agent --** A *host metrics agent* is a process - running on the host that collects performance statistics for host - subsystems such as memory, CPU, and network. These metrics are - regularly relayed to a metrics collector for storage and - visualization. Some examples are - [collectd](https://collectd.org/), - [telegraf](https://www.influxdata.com/time-series-platform/telegraf/), - and [metricbeat](https://www.elastic.co/beats/metricbeat). - -- **Metric aggregator --** A *metric aggregator* is a process running - on the host. Applications running on the host collect service - metrics using - [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)). - Collected metrics are sent either to the aggregator process or - directly to the metrics collector over API, if available. Received - metrics are aggregated periodically and relayed to the metrics - collector in batches. An example is - [StatsD](https://github.com/statsd/statsd). - -- **Metrics collector --** A *metrics collector* process collects all - the metrics from the metric aggregators running on multiple hosts. - The collector takes care of decoding and stores this data on the - database. Metric collection and storage might be taken care of by - one single service such as - [InfluxDB](https://www.influxdata.com/), which we discuss - next. An example is [carbon - daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html). - -- **Storage --** A time-series database stores all of these metrics. - Examples are [OpenTSDB](http://opentsdb.net/), - [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html), - and [InfluxDB](https://www.influxdata.com/). - -- **Metrics server --** A *metrics server* can be as basic as a web - server that graphically renders metric data. In addition, the - metrics server provides aggregation functionalities and APIs for - fetching metric data programmatically. Some examples are - [Grafana](https://github.com/grafana/grafana) and - [Graphite-Web](https://github.com/graphite-project/graphite-web). - -- **Alert manager --** The *alert manager* regularly polls metric data - available and, if there are any anomalies detected, notifies you. - Each alert has a set of rules for identifying such anomalies. - Today many metrics servers such as - [Grafana](https://github.com/grafana/grafana) support alert - management. We discuss alerting [in detail - later](#proactive-monitoring-using-alerts). Examples are - [Grafana](https://github.com/grafana/grafana) and - [Icinga](https://icinga.com/). diff --git a/courses/metrics_and_monitoring/observability.md b/courses/metrics_and_monitoring/observability.md deleted file mode 100644 index cdec5a4..0000000 --- a/courses/metrics_and_monitoring/observability.md +++ /dev/null @@ -1,151 +0,0 @@ -## - -# Observability - -Engineers often use observability when referring to building reliable -systems. *Observability* is a term derived from control theory, It is a -measure of how well internal states of a system can be inferred from -knowledge of its external outputs. Service infrastructures used on a -daily basis are becoming more and more complex; proactive monitoring -alone is not sufficient to quickly resolve issues causing application -failures. With monitoring, you can keep known past failures from -recurring, but with a complex service architecture, many unknown factors -can cause potential problems. To address such cases, you can make the -service observable. An observable system provides highly granular -insights into the implicit failure modes. In addition, an observable -system furnishes ample context about its inner workings, which unlocks -the ability to uncover deeper systemic issues. - -Monitoring enables failure detection; observability helps in gaining a -better understanding of the system. Among engineers, there is a common -misconception that monitoring and observability are two different -things. Actually, observability is the superset to monitoring; that is, -monitoring improves service observability. The goal of observability is -not only to detect problems, but also to understand where the issue is -and what is causing it. In addition to metrics, observability has two -more pillars: logs and traces, as shown in Figure 9. Although these -three components do not make a system 100 percent observable, these are -the most important and powerful components that give a better -understanding of the system. Each of these pillars has its flaws, which -are described in [Three Pillars with Zero -Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8). - -Figure 9: -Three pillars of observability
- -Because we have covered metrics already, let's look at the other two -pillars (logs and traces). - -#### Logs - -Logs (often referred to as *events*) are a record of activities -performed by a service during its run time, with a corresponding -timestamp. Metrics give abstract information about degradations in a -system, and logs give a detailed view of what is causing these -degradations. Logs created by the applications and infrastructure -components help in effectively understanding the behavior of the system -by providing details on application errors, exceptions, and event -timelines. Logs help you to go back in time to understand the events -that led to a failure. Therefore, examining logs is essential to -troubleshooting system failures. - -Log processing involves the aggregation of different logs from -individual applications and their subsequent shipment to central -storage. Moving logs to central storage helps to preserve the logs, in -case the application instances are inaccessible, or the application -crashes due to a failure. After the logs are available in a central -place, you can analyze the logs to derive sensible information from -them. For audit and compliance purposes, you archive these logs on the -central storage for a certain period of time. Log analyzers fetch useful -information from log lines, such as request user information, request -URL (feature), and response headers (such as content length) and -response time. This information is grouped based on these attributes and -made available to you through a visualization tool for quick -understanding. - -You might be wondering how this log information helps. This information -gives a holistic view of activities performed on all the involved -entities. For example, let's say someone is performing a DoS (denial of -service) attack on a web application. With the help of log processing, -you can quickly look at top client IPs derived from access logs and -identify where the attack is coming from. - -Similarly, if a feature in an application is causing a high error rate -when accessed with a particular request parameter value, the results of -log analysis can help you to quickly identify the misbehaving parameter -value and take further action. - - -Figure 10: Log processing and analysis using ELK stack
- -Figure 10 shows a log processing platform using ELK (Elasticsearch, -Logstash, Kibana), which provides centralized log processing. Beats is a -collection of lightweight data shippers that can ship logs, audit data, -network data, and so on over the network. In this use case specifically, -we are using filebeat as a log shipper. Filebeat watches service log -files and ships the log data to Logstash. Logstash parses these logs and -transforms the data, preparing it to store on Elasticsearch. Transformed -log data is stored on Elasticsearch and indexed for fast retrieval. -Kibana searches and displays log data stored on Elasticsearch. Kibana -also provides a set of visualizations for graphically displaying -summaries derived from log data. - -Storing logs is expensive. And extensive logging of every event on the -server is costly and takes up more storage space. With an increasing -number of services, this cost can increase proportionally to the number -of services. - -#### Tracing - -So far, we covered the importance of metrics and logging. Metrics give -an abstract overview of the system, and logging gives a record of events -that occurred. Imagine a complex distributed system with multiple -microservices, where a user request is processed by multiple -microservices in the system. Metrics and logging give you some -information about how these requests are being handled by the system, -but they fail to provide detailed information across all the -microservices and how they affect a particular client request. If a slow -downstream microservice is leading to increased response times, you need -to have detailed visibility across all involved microservices to -identify such microservice. The answer to this need is a request tracing -mechanism. - -A trace is a series of spans, where each span is a record of events -performed by different microservices to serve the client's request. In -simple terms, a trace is a log of client-request serving derived from -various microservices across different physical machines. Each span -includes span metadata such as trace ID and span ID, and context, which -includes information about transactions performed. - - -Figure 11: Trace and spans for a URL shortener request
- -Figure 11 is a graphical representation of a trace captured on the [URL -shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/) -example we covered earlier while learning Python. - -Similar to monitoring, the tracing infrastructure comprises a few -modules for collecting traces, storing them, and accessing them. Each -microservice runs a tracing library that collects traces in the -background, creates in-memory batches, and submits the tracing backend. -The tracing backend normalizes received trace data and stores it on -persistent storage. Tracing data comes from multiple different -microservices; therefore, trace storage is often organized to store data -incrementally and is indexed by trace identifier. This organization -helps in the reconstruction of trace data and in visualization. Figure -12 illustrates the anatomy of the distributed system. - - -Figure 12: Anatomy of distributed tracing
- -Today a set of tools and frameworks are available for building -distributed tracing solutions. Following are some of the popular tools: - -- [OpenTelemetry](https://opentelemetry.io/): Observability - framework for cloud-native software - -- [Jaeger](https://www.jaegertracing.io/): Open-source - distributed tracing solution - -- [Zipkin](https://zipkin.io/): Open-source distributed tracing - solution diff --git a/courses/metrics_and_monitoring/third-party_monitoring.md b/courses/metrics_and_monitoring/third-party_monitoring.md deleted file mode 100644 index e968caf..0000000 --- a/courses/metrics_and_monitoring/third-party_monitoring.md +++ /dev/null @@ -1,37 +0,0 @@ -## - -# Third-party monitoring - -Today most cloud providers offer a variety of monitoring solutions. In -addition, a number of companies such as -[Datadog](https://www.datadoghq.com/) offer -monitoring-as-a-service. In this section, we are not covering -monitoring-as-a-service in depth. - -In recent years, more and more people have access to the internet. Many -services are offered online to cater to the increasing user base. As a -result, web pages are becoming larger, with increased client-side -scripts. Users want these services to be fast and error-free. From the -service point of view, when the response body is composed, an HTTP 200 -OK response is sent, and everything looks okay. But there might be -errors during transmission or on the client side. As previously -mentioned, monitoring services from within the service infrastructure -give good visibility into service health, but this is not enough. You -need to monitor user experience, specifically the availability of -services for clients. A number of third-party services such asf -[Catchpoint](https://www.catchpoint.com/), -[Pingdom](https://www.pingdom.com/), and so on are available for -achieving this goal. - -Third-party monitoring services can generate synthetic traffic -simulating user requests from various parts of the world, to ensure the -service is globally accessible. Other third-party monitoring solutions -for real user monitoring (RUM) provide performance statistics such as -service uptime and response time, from different geographical locations. -This allows you to monitor the user experience from these locations, -which might have different internet backbones, different operating -systems, and different browsers and browser versions. [Catchpoint -Global Monitoring -Network](https://pages.catchpoint.com/overview-video) is a -comprehensive 3-minute video that explains the importance of monitoring -the client experience. \ No newline at end of file diff --git a/courses/python_web/intro.md b/courses/python_web/intro.md deleted file mode 100644 index db1abf3..0000000 --- a/courses/python_web/intro.md +++ /dev/null @@ -1,112 +0,0 @@ -# Python and The Web - -## Prerequisites - -- Basic understanding of python language. -- Basic familiarity with flask framework. - -## What to expect from this course - -This course is divided into two high level parts. In the first part, assuming familiarity with python language’s basic operations and syntax usage, we will dive a little deeper into understanding python as a language. We will compare python with other programming languages that you might already know like Java and C. We will also explore concepts of Python objects and with help of that, explore python features like decorators. - -In the second part which will revolve around the web, and also assume familiarity with the Flask framework, we will start from the socket module and work with HTTP requests. This will demystify how frameworks like flask work internally. - -And to introduce SRE flavour to the course, we will design, develop and deploy (in theory) a URL shortening application. We will emphasize parts of the whole process that are more important as an SRE of the said app/service. - -## What is not covered under this course - -Extensive knowledge of python internals and advanced python. - -## Lab Environment Setup - -Have latest version of python installed - -## Course Contents - -1. [The Python Language](https://linkedin.github.io/school-of-sre/python_web/intro/#the-python-language) - 1. [Some Python Concepts](https://linkedin.github.io/school-of-sre/python_web/python-concepts/) - 2. [Python Gotchas](https://linkedin.github.io/school-of-sre/python_web/python-concepts/#some-gotchas) -2. [Python and Web](https://linkedin.github.io/school-of-sre/python_web/python-web-flask/) - 1. [Sockets](https://linkedin.github.io/school-of-sre/python_web/python-web-flask/#sockets) - 2. [Flask](https://linkedin.github.io/school-of-sre/python_web/python-web-flask/#flask) -3. [The URL Shortening App](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/) - 1. [Design](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/#design) - 2. [Scaling The App](https://linkedin.github.io/school-of-sre/python_web/sre-conclusion/#scaling-the-app) - 3. [Monitoring The App](https://linkedin.github.io/school-of-sre/python_web/sre-conclusion/#monitoring-strategy) - -## The Python Language - -Assuming you know a little bit of C/C++ and Java, let's try to discuss the following questions in context of those two languages and python. You might have heard that C/C++ is a compiled language while python is an interpreted language. Generally, with compiled language we first compile the program and then run the executable while in case of python we run the source code directly like `python hello_world.py`. While Java, being an interpreted language, still has a separate compilation step and then its run. So what's really the difference? - -### Compiled vs. Interpreted - -This might sound a little weird to you: python, in a way is a compiled language! Python has a compiler built-in! It is obvious in the case of java since we compile it using a separate command ie: `javac helloWorld.java` and it will produce a `.class` file which we know as a _bytecode_. Well, python is very similar to that. One difference here is that there is no separate compile command/binary needed to run a python program. - -**What is the difference then, between java and python?** -Well, Java's compiler is more strict and sophisticated. As you might know Java is a statically typed language. So the compiler is written in a way that it can verify types related errors during compile time. While python being a _dynamic_ language, types are not known until a program is run. So in a way, python compiler is dumb (or, less strict). But there indeed is a compile step involved when a python program is run. You might have seen python bytecode files with `.pyc` extension. Here is how you can see bytecode for a given python program. - -```bash -# Create a Hello World -$ echo "print('hello world')" > hello_world.py - -# Making sure it runs -$ python3 hello_world.py -hello world - -# The bytecode of the given program -$ python -m dis hello_world.py - 1 0 LOAD_NAME 0 (print) - 2 LOAD_CONST 0 ('hello world') - 4 CALL_FUNCTION 1 - 6 POP_TOP - 8 LOAD_CONST 1 (None) - 10 RETURN_VALUE -``` - -Read more about dis module [here](https://docs.python.org/3/library/dis.html) - -Now coming to C/C++, there of course is a compiler. But the output is different than what java/python compiler would produce. Compiling a C program would produce what we also know as _machine code_. As opposed to bytecode. - -### Running The Programs - -We know compilation is involved in all 3 languages we are discussing. Just that the compilers are different in nature and they output different types of content. In case of C/C++, the output is machine code which can be directly read by your operating system. When you execute that program, your OS will know how exactly to run it. **But this is not the case with bytecode.** - -Those bytecodes are language specific. Python has its own set of bytecode defined (more in `dis` module) and so does java. So naturally, your operating system will not know how to run it. To run this bytecode, we have something called Virtual Machines. Ie: The JVM or the Python VM (CPython, Jython). These so called Virtual Machines are the programs which can read the bytecode and run it on a given operating system. Python has multiple VMs available. Cpython is a python VM implemented in C language, similarly Jython is a Java implementation of python VM. **At the end of the day, what they should be capable of is to understand python language syntax, be able to compile it to bytecode and be able to run that bytecode.** You can implement a python VM in any language! (And people do so, just because it can be done) - -``` - The Operating System - - +------------------------------------+ - | | - | | - | | -hello_world.py Python bytecode | Python VM Process | - | | -+----------------+ +----------------+ | +----------------+ | -|print(... | COMPILE |LOAD_CONST... | | |Reads bytecode | | -| +--------------->+ +------------------->+line by line | | -| | | | | |and executes. | | -| | | | | | | | -+----------------+ +----------------+ | +----------------+ | - | | - | | - | | -hello_world.c OS Specific machinecode | A New Process | - | | -+----------------+ +----------------+ | +----------------+ | -|void main() { | COMPILE | binary contents| | | binary contents| | -| +--------------->+ +------------------->+ | | -| | | | | | | | -| | | | | | | | -+----------------+ +----------------+ | +----------------+ | - | (binary contents | - | runs as is) | - | | - | | - +------------------------------------+ -``` - -Two things to note for above diagram: - -1. Generally, when we run a python program, a python VM process is started which reads the python source code, compiles it to byte code and run it in a single step. Compiling is not a separate step. Shown only for illustration purpose. -2. Binaries generated for C like languages are not _exactly_ run as is. Since there are multiple types of binaries (eg: ELF), there are more complicated steps involved in order to run a binary but we will not go into that since all that is done at OS level. diff --git a/courses/python_web/python-concepts.md b/courses/python_web/python-concepts.md deleted file mode 100644 index bceaf38..0000000 --- a/courses/python_web/python-concepts.md +++ /dev/null @@ -1,162 +0,0 @@ -# Some Python Concepts - -Though you are expected to know python and its syntax at basic level, let us discuss some fundamental concepts that will help you understand the python language better. - -**Everything in Python is an object.** - -That includes the functions, lists, dicts, classes, modules, a running function (instance of function definition), everything. In the CPython, it would mean there is an underlying struct variable for each object. - -In python's current execution context, all the variables are stored in a dict. It'd be a string to object mapping. If you have a function and a float variable defined in the current context, here is how it is handled internally. - -```python ->>> float_number=42.0 ->>> def foo_func(): -... pass -... - -# NOTICE HOW VARIABLE NAMES ARE STRINGS, stored in a dict ->>> locals() -{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__':You should be redirected automatically to target URL: https://linkedin.com. If not click the link.
-"""
-```
diff --git a/courses/security/conclusion.md b/courses/security/conclusion.md
deleted file mode 100644
index 8eeceea..0000000
--- a/courses/security/conclusion.md
+++ /dev/null
@@ -1,25 +0,0 @@
-# Conclusion
-
-Now that you have completed this course on Security you are now aware of the possible security threats to computer systems & networks. Not only that, but you are now better able to protect your systems as well as recommend security measures to others.
-
-This course provides fundamental everyday knowledge on security domain which will also help you keep security at the top of your priority.
-
-## Other Resources
-
-Some books that would be a great resource
-
-- Holistic Info-Sec for Web Developers