diff --git a/CODE_OF_CONDUCT/index.html b/CODE_OF_CONDUCT/index.html index 5b2d6dd..fe28caf 100644 --- a/CODE_OF_CONDUCT/index.html +++ b/CODE_OF_CONDUCT/index.html @@ -2177,7 +2177,7 @@

Note: Some LinkedIn-managed communities have codes of conduct that pre-date this document and issue resolution process. While communities are not required to change their code, they are expected to use the resolution process outlined here. The review team will coordinate with the communities involved to address your concerns.

Reporting Code of Conduct Issues

We encourage all communities to resolve issues on their own whenever possible. This builds a broader and deeper understanding and ultimately a healthier interaction. In the event that an issue cannot be resolved locally, please feel free to report your concerns by contacting oss@linkedin.com.

-

In your report please include:

+

In your report, please include:

A query language to interact with and manage data.

-

CRUD operations - create, read, update, delete queries

-

Management operations - create DBs/tables/indexes etc, backup, import/export, users, access controls

-

Exercise: Classify the below queries into the four types - DDL (definition), DML(manipulation), DCL(control) and TCL(transactions) and explain in detail.

+

CRUD operations—create, read, update, delete queries

+

Management operations—create DBs/tables/indexes, backup, import/export, users, access controls, etc

+

Exercise: Classify the below queries into the four types—DDL (definition), DML (manipulation), DCL (control) and TCL (transactions) and explain in detail.

insert, create, drop, delete, update, commit, rollback, truncate, alter, grant, revoke
 

You can practise these in the lab section.

@@ -2188,19 +2188,19 @@
  • Constraints

    Rules for data that can be stored. Query fails if you violate any of these defined on a table.

    -

    Primary key: one or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default.

    -

    Foreign key: links two tables together. Its value(s) match a primary key in a different table \ -Not null: Does not allow null values \ -Unique: Value of column must be unique across all rows \ -Default: Provides a default value for a column if none is specified during insert

    -

    Check: Allows only particular values (like Balance >= 0)

    +

    Primary key: One or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default.

    +

    Foreign key: Links two tables together. Its value(s) match a primary key in a different table

    +

    Not null: Does not allow null values

    +

    Unique: Value of column must be unique across all rows

    +

    Default: Provides a default value for a column if none is specified during insert

    +

    Check: Allows only particular values (like Balance >= 0)

  • Indexes

    Most indexes use B+ tree structure.

    -

    Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration etc)

    -

    Types of indexes: unique, primary key, fulltext, secondary

    -

    Write-heavy loads, mostly full table scans or accessing large number of rows etc. do not benefit from indexes

    +

    Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration, etc)

    +

    Types of indexes: unique, primary key, fulltext, secondary

    +

    Write-heavy loads, mostly full table scans or accessing large number of rows, etc. do not benefit from indexes

  • Joins

    @@ -2208,14 +2208,14 @@ Default: Provides a default value for a column if none is specified during inser
  • Access control

    -

    DBs have privileged accounts for admin tasks, and regular accounts for clients. There are finegrained controls on what actions(DDL, DML etc. discussed earlier )are allowed for these accounts.

    +

    DBs have privileged accounts for admin tasks, and regular accounts for clients. There are fine-grained controls on what actions (DDL, DML, etc. discussed earlier) are allowed for these accounts.

    DB first verifies the user credentials (authentication), and then examines whether this user is permitted to perform the request (authorization) by looking up these information in some internal tables.

    -

    Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections etc. allowed.

    +

    Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections, etc. allowed.

  • -

    Commercial, closed source - Oracle, Microsoft SQL Server, IBM DB2

    -

    Open source with optional paid support - MySQL, MariaDB, PostgreSQL

    +

    Commercial, closed source: Oracle, Microsoft SQL Server, IBM DB2

    +

    Open source with optional paid support: MySQL, MariaDB, PostgreSQL

    Individuals and small companies have always preferred open source DBs because of the huge cost associated with commercial software.

    In recent times, even large organizations have moved away from commercial software to open source alternatives because of the flexibility and cost savings associated with it.

    Lack of support is no longer a concern because of the paid support available from the developer and third parties.

    diff --git a/level101/databases_sql/conclusion/index.html b/level101/databases_sql/conclusion/index.html index 1956849..79c1422 100644 --- a/level101/databases_sql/conclusion/index.html +++ b/level101/databases_sql/conclusion/index.html @@ -2151,7 +2151,7 @@

    Conclusion

    -

    We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for - there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further.

    +

    We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for—there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further.

    Further reading

    We will now try to understand what each command does and how to use these commands. You should also practice the given examples on the -online bash shell.

    -

    We will create a new file called "numbers.txt" and insert numbers from 1 +online Bash shell.

    +

    We will create a new file called numbers.txt and insert numbers from 1 to 10 in this file. Each number will be in a separate line.

    grep

    -

    The grep command in its simplest form can be used to search particular +

    The grep command in its simplest form can be used to search particular words in a text file. It will display all the lines in a file that contains a particular input. The word we want to search is provided as -an input to the grep command.

    -

    General syntax of using grep command:

    -
    grep <word_to_search> <file_name>
    +an input to the grep command.

    +

    General syntax of using grep command:

    +
    grep <word_to_search> <file_name>
     

    In this example, we are trying to search for a string "1" in this file. -The grep command outputs the lines where it found this string.

    +The grep command outputs the lines where it found this string.

    sed

    -

    The sed command in its simplest form can be used to replace a text in a +

    The sed command in its simplest form can be used to replace a text in a file.

    -

    General syntax of using the sed command for replacement:

    -
    sed 's/<text_to_replace>/<replacement_text>/' <file_name>
    +

    General syntax of using the sed command for replacement:

    +
    sed 's/<text_to_replace>/<replacement_text>/' <file_name>
     

    Let's try to replace each occurrence of "1" in the file with "3" using -sed command.

    +sed command.

    The content of the file will not change in the above -example. To do so, we have to use an extra argument '-i' so that the +example. To do so, we have to use an extra argument -i so that the changes are reflected back in the file.

    sort

    -

    The sort command can be used to sort the input provided to it as an +

    The sort command can be used to sort the input provided to it as an argument. By default, it will sort in increasing order.

    Let's first see the content of the file before trying to sort it.

    -

    Now, we will try to sort the file using the sort command. The sort +

    Now, we will try to sort the file using the sort command. The sort command sorts the content in lexicographical order.

    The content of the file will not change in the above @@ -2838,28 +2881,28 @@ example.

    I/O Redirection

    Each open file gets assigned a file descriptor. A file descriptor is an unique identifier for open files in the system. There are always three -default files open, stdin (the keyboard), stdout (the screen), and -stderr (error messages output to the screen). These files can be +default files open, stdin (the keyboard), stdout (the screen), and +stderr (error messages output to the screen). These files can be redirected.

    -

    Everything is a file in linux - +

    Everything is a file in Linux - https://unix.stackexchange.com/questions/225537/everything-is-a-file

    Till now, we have displayed all the output on the screen which is the standard output. We can use some special operators to redirect the output of the command to files or even to the input of other commands. I/O redirection is a very powerful feature.

    -

    In the below example, we have used the '>' operator to redirect the -output of ls command to output.txt file.

    +

    In the below example, we have used the > operator to redirect the +output of ls command to output.txt file.

    -

    In the below example, we have redirected the output from echo command to +

    In the below example, we have redirected the output from echo command to a file.

    We can also redirect the output of a command as an input to another command. This is possible with the help of pipes.

    -

    In the below example, we have passed the output of cat command as an -input to grep command using pipe(|) operator.

    +

    In the below example, we have passed the output of cat command as an +input to grep command using pipe (|) operator.

    -

    In the below example, we have passed the output of sort command as an -input to uniq command using pipe(|) operator. The uniq command only +

    In the below example, we have passed the output of sort command as an +input to uniq command using pipe (|) operator. The uniq command only prints the unique numbers from the input.

    I/O redirection - diff --git a/level101/linux_basics/conclusion/index.html b/level101/linux_basics/conclusion/index.html index 7450a75..0254b4d 100644 --- a/level101/linux_basics/conclusion/index.html +++ b/level101/linux_basics/conclusion/index.html @@ -334,7 +334,7 @@

  • - Useful Courses and tutorials + Useful Courses and Tutorials
  • @@ -2146,7 +2146,7 @@
  • - Useful Courses and tutorials + Useful Courses and Tutorials
  • @@ -2165,7 +2165,7 @@

    Conclusion

    -

    We have covered the basics of Linux operating systems and basic commands used in linux. +

    We have covered the basics of Linux operating systems and basic commands used in Linux. We have also covered the Linux server administration commands.

    We hope that this course will make it easier for you to operate on the command line.

    Applications in SRE Role

    @@ -2176,12 +2176,12 @@ We have also covered the Linux server administration commands.

  • tail command is very useful to view the latest data in the log file.
  • Different users will have different permissions depending on their roles. We will also not want everyone in the company to access our servers for security reasons. Users permissions can be restricted with chown, chmod and chgrp commands.
  • ssh is one of the most frequently used commands for a SRE. Logging into servers and troubleshooting along with performing basic administration tasks will only be possible if we are able to login into the server.
  • -
  • What if we want to run an apache server or nginx on a server? We will first install it using the package manager. Package management commands become important here.
  • -
  • Managing services on servers is another critical responsibility of a SRE. Systemd related commands can help in troubleshooting issues. If a service goes down, we can start it using systemctl start command. We can also stop a service in case it is not needed.
  • -
  • Monitoring is another core responsibility of a SRE. Memory and CPU are two important system level metrics which should be monitored. Commands like top and free are quite helpful here.
  • -
  • If a service is throwing an error, how do we find out the root cause of the error ? We will certainly need to check logs to find out the whole stack trace of the error. The log file will also tell us the number of times the error has occurred along with time when it started.
  • +
  • What if we want to run an Apache server or NGINX on a server? We will first install it using the package manager. Package management commands become important here.
  • +
  • Managing services on servers is another critical responsibility of a SRE. systemd-related commands can help in troubleshooting issues. If a service goes down, we can start it using systemctl start command. We can also stop a service in case it is not needed.
  • +
  • Monitoring is another core responsibility of a SRE. Memory and CPU are two important system-level metrics which should be monitored. Commands like top and free are quite helpful here.
  • +
  • If a service throws an error, how do we find out the root cause of the error? We will certainly need to check logs to find out the whole stack trace of the error. The log file will also tell us the number of times the error has occurred along with time when it started.
  • -

    Useful Courses and tutorials

    +

    Useful Courses and Tutorials

    • Edx basic linux commands course
    • Edx Red Hat Enterprise Linux Course
    • diff --git a/level101/linux_basics/intro/index.html b/level101/linux_basics/intro/index.html index d12edb6..c528a77 100644 --- a/level101/linux_basics/intro/index.html +++ b/level101/linux_basics/intro/index.html @@ -2306,7 +2306,7 @@

      Introduction

      Prerequisites

        -
      • Should be comfortable in using any operating systems like Windows, Linux or Mac
      • +
      • Should be comfortable in using any operating systems like Windows, Linux or
      • Expected to have fundamental knowledge of operating systems

      What to expect from this course

      @@ -2316,13 +2316,13 @@ Linux distributions and uses of Linux operating systems. We will also talk about difference between GUI and CLI.

      In the second part, we cover some basic commands used in Linux. We will focus on commands used for navigating the file system, viewing and manipulating files, -I/O redirection etc.

      -

      In the third part, we cover Linux system administration. This includes day to day tasks +I/O redirection, etc.

      +

      In the third part, we cover Linux system administration. This includes day-to-day tasks performed by Linux admins, like managing users/groups, managing file permissions, monitoring system performance, log files etc.

      -

      In the second and third part, we will be taking examples to understand the concepts.

      +

      In the second and third part, we will be showing examples to understand the concepts.

      What is not covered under this course

      -

      We are not covering advanced Linux commands and bash scripting in this +

      We are not covering advanced Linux commands and Bash scripting in this course. We will also not be covering Linux internals.

      Course Contents

      The following topics has been covered in this course:

      @@ -2370,26 +2370,26 @@ course. We will also not be covering Linux internals.

      Most of us are familiar with the Windows operating system used in more than 75% of the personal computers. The Windows operating systems are based on Windows NT kernel.

      -

      A kernel is the most important part of -an operating system - it performs important functions like process -management, memory management, filesystem management etc.

      -

      Linux operating systems are based on the Linux kernel. A Linux based +

      A kernel is the most important part of +an operating system—it performs important functions like process +management, memory management, filesystem management, etc.

      +

      Linux operating systems are based on the Linux kernel. A Linux-based operating system will consist of Linux kernel, GUI/CLI, system libraries and system utilities. The Linux kernel was independently developed and -released by Linus Torvalds. The Linux kernel is free and open-source - -https://github.com/torvalds/linux

      -

      Linux is a kernel and not a complete operating system. Linux kernel is combined with GNU system to make a complete operating system. Therefore, linux based operating systems are also called as GNU/Linux systems. GNU is an extensive collection of free softwares like compiler, debugger, C library etc. -Linux and the GNU System

      +released by Linus Torvalds. The Linux kernel is free and open-source (See +https://github.com/torvalds/linux).

      +

      Linux is a kernel and not a complete operating system. Linux kernel is combined with GNU system to make a complete operating system. Therefore, Linux-based operating systems are also called as GNU/Linux systems. GNU is an extensive collection of free softwares like compiler, debugger, C library etc. (See +Linux and the GNU System)

      History of Linux - https://en.wikipedia.org/wiki/History_of_Linux

      -

      A Linux distribution(distro) is an operating system based on +

      A Linux distribution (distro) is an operating system based on the Linux kernel and a package management system. A package management system consists of tools that help in installing, upgrading, configuring and removing softwares on the operating system.

      Software are usually adopted to a distribution and are packaged in a -distro specific format. These packages are available through a distro -specific repository. Packages are installed and managed in the operating +distro-specific format. These packages are available through a distro-specific +repository. Packages are installed and managed in the operating system by a package manager.

      List of popular Linux distributions:

        @@ -2425,12 +2425,12 @@ system by a package manager.

        -Debian style (.deb) +Debian style (.deb) Debian, Ubuntu APT -Red Hat style (.rpm) +Red Hat style (.rpm) Fedora, CentOS, Red Hat Enterprise Linux YUM @@ -2465,13 +2465,13 @@ system by a package manager.

        Mobile phones - Android is based on Linux operating system

      • -

        Embedded devices - watches, televisions, traffic lights etc

        +

        Embedded devices - watches, televisions, traffic lights, etc.

      • Satellites

      • -

        Network devices - routers, switches etc.

        +

        Network devices - routers, switches, etc.

      Graphical user interface (GUI) vs Command line interface (CLI)

      @@ -2493,9 +2493,9 @@ example of a CLI (command line interface). Bash is one of the most popular shell programs available on Linux servers. Other popular shell programs are zsh, ksh and tcsh.

      Terminal is a program that opens a window and lets you interact with the -shell. Some popular examples of terminals are gnome-terminal, xterm, -konsole etc.

      -

      Linux users do use the terms shell, terminal, prompt, console etc. +shell. Some popular examples of terminals are GNOME-terminal, xterm, +Konsole, etc.

      +

      Linux users do use the terms shell, terminal, prompt, console, etc. interchangeably. In simple terms, these all refer to a way of taking commands from the user.

      diff --git a/level101/linux_basics/linux_server_administration/index.html b/level101/linux_basics/linux_server_administration/index.html index f8d28a9..6ad21a3 100644 --- a/level101/linux_basics/linux_server_administration/index.html +++ b/level101/linux_basics/linux_server_administration/index.html @@ -2709,7 +2709,7 @@

      Linux Server Administration

      -

      In this course will try to cover some of the common tasks that a linux +

      In this course, will try to cover some of the common tasks that a Linux server administrator performs. We will first try to understand what a particular command does and then try to understand the commands using examples. Do keep in mind that it's very important to practice the Linux @@ -2717,7 +2717,7 @@ commands on your own.

      Lab Environment Setup

      Multi-User Operating Systems

      -

      An operating system is considered as multi-user if it allows multiple people/users to use a computer and not affect each other's files and preferences. Linux based operating systems are multi-user in nature as it allows multiple users to access the system at the same time. A typical computer will only have one keyboard and monitor but multiple users can log in via SSH if the computer is connected to the network. We will cover more about SSH later.

      +

      An operating system is considered as multi-user if it allows multiple people/users to use a computer and not affect each other's files and preferences. Linux-based operating systems are multi-user in nature as it allows multiple users to access the system at the same time. A typical computer will only have one keyboard and monitor but multiple users can log in via SSH if the computer is connected to the network. We will cover more about SSH later.

      As a server administrator, we are mostly concerned with the Linux servers which are physically present at a very large distance from us. We can connect to these servers with the help of remote login methods like SSH.

      Since Linux supports multiple users, we need to have a method which can protect the users from each other. One user should not be able to access and modify files of other users

      User/Group Management

      @@ -2747,25 +2747,29 @@ commands on your own.

    id command

    -

    id command can be used to find the uid and gid associated with an user. +

    id command can be used to find the uid and gid associated with an user. It also lists down the groups to which the user belongs to.

    -

    The uid and gid associated with the root user is 0. -

    -

    A good way to find out the current user in Linux is to use the whoami +

    The uid and gid associated with the root user is 0.

    +

    +

    A good way to find out the current user in Linux is to use the whoami command.

    -

    "root" user or superuser is the most privileged user with +

    root user or superuser is the most privileged user with unrestricted access to all the resources on the system. It has UID 0

    Important files associated with users/groups

    - - + + + + + + @@ -2778,7 +2782,7 @@ command.

    -

    If you want to understand each filed discussed in the above outputs, you can go +

    If you want to understand each field discussed in the above outputs, you can go through below links:

    • @@ -2792,25 +2796,17 @@ through below links:

      Some of the commands which are used frequently to manage users/groups on Linux are following:

        -
      • -

        useradd - Creates a new user

        -
      • -
      • -

        passwd - Adds or modifies passwords for a user

        -
      • -
      • -

        usermod - Modifies attributes of an user

        -
      • -
      • -

        userdel - Deletes an user

        -
      • +
      • useradd - Creates a new user
      • +
      • passwd - Adds or modifies passwords for a user
      • +
      • usermod - Modifies attributes of an user
      • +
      • userdel - Deletes an user

      useradd

      -

      The useradd command adds a new user in Linux.

      -

      We will create a new user 'shivam'. We will also verify that the user -has been created by tailing the /etc/passwd file. The uid and gid are +

      The useradd command adds a new user in Linux.

      +

      We will create a new user shivam. We will also verify that the user +has been created by tailing the /etc/passwd file. The uid and gid are 1000 for the newly created user. The home directory assigned to the user -is /home/shivam and the login shell assigned is /bin/bash. Do note that +is /home/shivam and the login shell assigned is /bin/bash. Do note that the user home directory and login shell can be modified later on.

      If we do not specify any value for attributes like home directory or @@ -2818,13 +2814,13 @@ login shell, default values will be assigned to the user. We can also override these default values when creating a new user.

      passwd

      -

      The passwd command is used to create or modify passwords for a user.

      +

      The passwd command is used to create or modify passwords for a user.

      In the above examples, we have not assigned any password for users -'shivam' or 'amit' while creating them.

      -

      "!!" in an account entry in shadow means the account of an user has +shivam or amit while creating them.

      +

      !! in an account entry in shadow means the account of an user has been created, but not yet given a password.

      -

      Let's now try to create a password for user "shivam".

      +

      Let's now try to create a password for user shivam.

      Do remember the password as we will be later using examples where it will be useful.

      @@ -2833,112 +2829,116 @@ from a normal user to root user, it will request you for a password. Also, when you login using root user, the password will be asked.

      usermod

      -

      The usermod command is used to modify the attributes of an user like the +

      The usermod command is used to modify the attributes of an user like the home directory or the shell.

      -

      Let's try to modify the login shell of user "amit" to "/bin/bash".

      +

      Let's try to modify the login shell of user amit to /bin/bash.

      In a similar way, you can also modify many other attributes for a user. -Try 'usermod -h' for a list of attributes you can modify.

      +Try usermod -h for a list of attributes you can modify.

      userdel

      -

      The userdel command is used to remove a user on Linux. Once we remove a +

      The userdel command is used to remove a user on Linux. Once we remove a user, all the information related to that user will be removed.

      -

      Let's try to delete the user "amit". After deleting the user, you will -not find the entry for that user in "/etc/passwd" or "/etc/shadow" file.

      +

      Let's try to delete the user amit. After deleting the user, you will +not find the entry for that user in /etc/passwd or /etc/shadow file.

      Important commands for managing groups

      -

      Commands for managing groups are quite similar to the commands used for managing users. Each command is not explained in detail here as they are quite similar. You can try running these commands on your system.

      +

      Commands for managing groups are quite similar to the commands used for managing users. Each command is not explained in detail here as they are quite similar. You can try running these commands on your system.

    /etc/passwdStores the user name, the uid, the gid, the home directory, the login shell etcFilesDescription
    /etc/passwdStores the user name, the uid, the gid, the home directory, the login shell etc
    /etc/shadow Stores the password associated with the users
    - - + + - + + + + + - + - +
    groupadd \<group_name>Creates a new groupCommandDescription
    groupmod \<group_name>groupadd <group_name>Creates a new group
    groupmod <group_name> Modifies attributes of a group
    groupdel \<group_name>groupdel <group_name> Deletes a group
    gpasswd \<group_name>gpasswd <group_name> Modifies password for group

    -

    We will now try to add user "shivam" to the group we have created above.

    +

    We will now try to add user shivam to the group we have created above.

    Becoming a Superuser

    Before running the below commands, do make sure that you have set up a -password for user "shivam" and user "root" using the passwd command +password for user shivam and user root using the passwd command described in the above section.

    -

    The su command can be used to switch users in Linux. Let's now try to -switch to user "shivam".

    +

    The su command can be used to switch users in Linux. Let's now try to +switch to user shivam.

    -

    Let's now try to open the "/etc/shadow" file.

    +

    Let's now try to open the /etc/shadow file.

    -

    The operating system didn't allow the user "shivam" to read the content -of the "/etc/shadow" file. This is an important file in Linux which -stores the passwords of users. This file can only be accessed by root or -users who have the superuser privileges.

    -

    The sudo command allows a user to run commands with the security +

    The operating system didn't allow the user shivam to read the content +of the /etc/shadow file. This is an important file in Linux which +stores the passwords of users. This file can only be accessed by root or +users who have the superuser privileges.

    +

    The sudo command allows a user to run commands with the security privileges of the root user. Do remember that the root user has all -the privileges on a system. We can also use su command to switch to the +the privileges on a system. We can also use su command to switch to the root user and open the above file but doing that will require the password of the root user. An alternative way which is preferred on most -modern operating systems is to use sudo command for becoming a +modern operating systems is to use sudo command for becoming a superuser. Using this way, a user has to enter his/her password and they -need to be a part of the sudo group.

    +need to be a part of the sudo group.

    How to provide superpriveleges to other users ?

    -

    Let's first switch to the root user using su command. Do note that using +

    Let's first switch to the root user using su command. Do note that using the below command will need you to enter the password for the root user.

    -

    In case, you forgot to set a password for the root user, type "exit" and +

    In case, you forgot to set a password for the root user, type exit and you will be back as the root user. Now, set up a password using the -passwd command.

    -

    The file /etc/sudoers holds the names of users permitted to invoke -sudo. In redhat operating systems, this file is not present by -default. We will need to install sudo.

    +passwd command.

    +

    The file /etc/sudoers holds the names of users permitted to invoke +sudo. In Red Hat operating systems, this file is not present by +default. We will need to install sudo.

    -

    We will discuss the yum command in detail in later sections.

    -

    Try to open the "/etc/sudoers" file on the system. The file has a lot of +

    We will discuss the yum command in detail in later sections.

    +

    Try to open the /etc/sudoers file on the system. The file has a lot of information. This file stores the rules that users must follow when -running the sudo command. For example, root is allowed to run any +running the sudo command. For example, root is allowed to run any commands from anywhere.

    One easy way of providing root access to users is to add them to a group -which has permissions to run all the commands. "wheel" is a group in -redhat Linux with such privileges.

    +which has permissions to run all the commands. wheel is a group in +Red Hat Linux with such privileges.

    -

    Let's add the user "shivam" to this group so that it also has sudo +

    Let's add the user shivam to this group so that it also has sudo privileges.

    -

    Let's now switch back to user "shivam" and try to access the -"/etc/shadow" file.

    +

    Let's now switch back to user shivam and try to access the +/etc/shadow file.

    -

    We need to use sudo before running the command since it can only be -accessed with the sudo privileges. We have already given sudo privileges -to user “shivam” by adding him to the group “wheel”.

    +

    We need to use sudo before running the command since it can only be +accessed with the sudo privileges. We have already given sudo privileges +to user shivam by adding him to the group wheel.

    File Permissions

    On a Linux operating system, each file and directory is assigned access permissions for the owner of the file, the members of a group of related users and everybody else. This is to make sure that one user is not allowed to access the files and resources of another user.

    -

    To see the permissions of a file, we can use the ls command. Let's look -at the permissions of /etc/passwd file.

    +

    To see the permissions of a file, we can use the ls command. Let's look +at the permissions of /etc/passwd file.

    Let's go over some of the important fields in the output that are related to file permissions.

    Chmod command

    -

    The chmod command is used to modify files and directories permissions in +

    The chmod command is used to modify files and directories permissions in Linux.

    -

    The chmod command accepts permissions in as a numerical argument. We can +

    The chmod command accepts permissions in as a numerical argument. We can think of permission as a series of bits with 1 representing True or allowed and 0 representing False or not allowed.

    @@ -3004,56 +3004,56 @@ allowed and 0 representing False or not allowed.

    We will now create a new file and check the permission of the file.

    The group owner doesn't have the permission to write to this file. Let's -give the group owner or root the permission to write to it using chmod +give the group owner or root the permission to write to it using chmod command.

    -

    Chmod command can be also used to change the permissions of a directory +

    chmod command can be also used to change the permissions of a directory in the similar way.

    Chown command

    -

    The chown command is used to change the owner of files or +

    The chown command is used to change the owner of files or directories in Linux.

    -

    Command syntax: chown \<new_owner> \<file_name>

    +

    Command syntax: chown \<new_owner\> \<file_name\>

    -

    In case, we do not have sudo privileges, we need to use sudo -command. Let's switch to user 'shivam' and try changing the owner. We -have also changed the owner of the file to root before running the below +

    In case, we do not have sudo privileges, we need to use sudo +command. Let's switch to user shivam and try changing the owner. We +have also changed the owner of the file to root before running the below command.

    Chown command can also be used to change the owner of a directory in the similar way.

    Chgrp command

    -

    The chgrp command can be used to change the group ownership of files or -directories in Linux. The syntax is very similar to that of chown +

    The chgrp command can be used to change the group ownership of files or +directories in Linux. The syntax is very similar to that of chown command.

    -

    Chgrp command can also be used to change the owner of a directory in the +

    chgrp command can also be used to change the owner of a directory in the similar way.

    SSH Command

    -

    The ssh command is used for logging into the remote systems, transfer files between systems and for executing commands on a remote machine. SSH stands for secure shell and is used to provide an encrypted secured connection between two hosts over an insecure network like the internet.

    +

    The ssh command is used for logging into the remote systems, transfer files between systems and for executing commands on a remote machine. SSH stands for secure shell and is used to provide an encrypted secured connection between two hosts over an insecure network like the internet.

    Reference: https://www.ssh.com/ssh/command/

    We will now discuss passwordless authentication which is secure and most -commonly used for ssh authentication.

    +commonly used for ssh authentication.

    Passwordless Authentication Using SSH

    -

    Using this method, we can ssh into hosts without entering the password. +

    Using this method, we can ssh into hosts without entering the password. This method is also useful when we want some scripts to perform ssh-related tasks.

    Passwordless authentication requires the use of a public and private key pair. As the name implies, the public key can be shared with anyone but the private key should be kept private. -Lets not get into the details of how this authentication works. You can read more about it +Let's not get into the details of how this authentication works. You can read more about it here

    Steps for setting up a passwordless authentication with a remote host:

    1. Generating public-private key pair

      -

      If we already have a key pair stored in \~/.ssh directory, we will not need to generate keys again.

      -

      Install openssh package which contains all the commands related to ssh.

      +

      If we already have a key pair stored in ~/.ssh directory, we will not need to generate keys again.

      +

      Install openssh package which contains all the commands related to ssh.

      -

      Generate a key pair using the ssh-keygen command. One can choose the +

      Generate a key pair using the ssh-keygen command. One can choose the default values for all prompts.

      -

      After running the ssh-keygen command successfully, we should see two -keys present in the \~/.ssh directory. Id_rsa is the private key and -id_rsa.pub is the public key. Do note that the private key can only be +

      After running the ssh-keygen command successfully, we should see two +keys present in the ~/.ssh directory. id_rsa is the private key and +id_rsa.pub is the public key. Do note that the private key can only be read and modified by you.

    2. @@ -3061,26 +3061,30 @@ read and modified by you.

      Transferring the public key to the remote host

      There are multiple ways to transfer the public key to the remote server. We will look at one of the most common ways of doing it using the -ssh-copy-id command.

      +ssh-copy-id command.

      -

      Install the openssh-clients package to use ssh-copy-id command.

      +

      Install the openssh-clients package to use ssh-copy-id command.

      -

      Use the ssh-copy-id command to copy your public key to the remote host.

      +

      Use the ssh-copy-id command to copy your public key to the remote host.

      -

      Now, ssh into the remote host using the password authentication.

      +

      Now, ssh into the remote host using the password authentication.

      -

      Our public key should be there in \~/.ssh/authorized_keys now.

      +

      Our public key should be there in ~/.ssh/authorized_keys now.

      -

      \~/.ssh/authorized_key contains a list of public keys. The users -associated with these public keys have the ssh access into the remote +

      ~/.ssh/authorized_key contains a list of public keys. The users +associated with these public keys have the ssh access into the remote host.

    How to run commands on a remote host ?

    -

    General syntax: ssh \<user>@\<hostname/hostip> \<command>

    +

    General syntax:

    +
    ssh \<user\>@\<hostname/hostip\> \<command\>
    +

    How to transfer files from one host to another host ?

    -

    General syntax: scp \<source> \<destination>

    +

    General syntax:

    +
    scp \<source\> \<destination\>
    +

    Package Management

    Package management is the process of installing and managing software on @@ -3096,11 +3100,11 @@ systems.

    - + - + @@ -3115,110 +3119,110 @@ systems.

    - + - - + + - + - +
    Debian style (.deb)Debian style (.deb) Debian, Ubuntu
    Red Hat style (.rpm)Red Hat style (.rpm) Fedora, CentOS, Red Hat Enterprise Linux
    yum install \<package_name>yum install <package_name> Installs a package on your system
    yum update \<package_name>Updates a package to it's latest available versionyum update <package_name>Updates a package to its latest available version
    yum remove \<package_name>yum remove <package_name> Removes a package from your system
    yum search \<keyword>yum search <keyword> Searches for a particular keyword

    DNF is the successor to YUM which is now used in Fedora for installing and -managing packages. DNF may replace YUM in the future on all RPM based +managing packages. DNF may replace YUM in the future on all RPM-based Linux distributions.

    -

    We did find an exact match for the keyword httpd when we searched using -yum search command. Let's now install the httpd package.

    +

    We did find an exact match for the keyword httpd when we searched using +yum search command. Let's now install the httpd package.

    -

    After httpd is installed, we will use the yum remove command to remove -httpd package.

    +

    After httpd is installed, we will use the yum remove command to remove +httpd package.

    Process Management

    In this section, we will study about some useful commands that can be used to monitor the processes on Linux systems.

    ps (process status)

    -

    The ps command is used to know the information of a process or list of +

    The ps command is used to know the information of a process or list of processes.

    -

    If you get an error "ps command not found" while running ps command, do -install procps package.

    -

    ps without any arguments is not very useful. Let's try to list all the +

    If you get an error "ps command not found" while running ps command, do +install procps package.

    +

    ps without any arguments is not very useful. Let's try to list all the processes on the system by using the below command.

    Reference: https://unix.stackexchange.com/questions/106847/what-does-aux-mean-in-ps-aux

    -

    We can use an additional argument with ps command to list the -information about the process with a specific process ID.

    +

    We can use an additional argument with ps command to list the +information about the process with a specific process ID (PID).

    -

    We can use grep in combination with ps command to list only specific +

    We can use grep in combination with ps command to list only specific processes.

    top

    -

    The top command is used to show information about Linux processes +

    The top command is used to show information about Linux processes running on the system in real time. It also shows a summary of the system information.

    -

    For each process, top lists down the process ID, owner, priority, state, -cpu utilization, memory utilization and much more information. It also -lists down the memory utilization and cpu utilization of the system as a -whole along with system uptime and cpu load average.

    +

    For each process, top lists down the process ID, owner, priority, state, +CPU utilization, memory utilization and much more information. It also +lists down the memory utilization and CPU utilization of the system as a +whole along with system uptime and CPU load average.

    Memory Management

    In this section, we will study about some useful commands that can be used to view information about the system memory.

    free

    -

    The free command is used to display the memory usage of the system. The +

    The free command is used to display the memory usage of the system. The command displays the total free and used space available in the RAM along with space occupied by the caches/buffers.

    -

    free command by default shows the memory usage in kilobytes. We can use +

    free command by default shows the memory usage in kilobytes. We can use an additional argument to get the data in human-readable format.

    vmstat

    -

    The vmstat command can be used to display the memory usage along with -additional information about io and cpu usage.

    +

    The vmstat command can be used to display the memory usage along with +additional information about IO and CPU usage.

    Checking Disk Space

    In this section, we will study about some useful commands that can be used to view disk space on Linux.

    df (disk free)

    -

    The df command is used to display the free and available space for each +

    The df command is used to display the free and available space for each mounted file system.

    du (disk usage)

    -

    The du command is used to display disk usage of files and directories on +

    The du command is used to display disk usage of files and directories on the system.

    The below command can be used to display the top 5 largest directories -in the root directory.

    +in the root directory.

    Daemons

    -

    A computer program that runs as a background process is called a daemon. -Traditionally, the name of daemon processes ended with d - sshd, httpd +

    A computer program that runs as a background process is called a daemon. +Traditionally, the name of daemon processes ends with d - sshd, httpd, etc. We cannot interact with a daemon process as they run in the background.

    Services and daemons are used interchangeably most of the time.

    Systemd

    -

    Systemd is a system and service manager for Linux operating systems. -Systemd units are the building blocks of systemd. These units are +

    systemd is a system and service manager for Linux operating systems. +systemd units are the building blocks of systemd. These units are represented by unit configuration files.

    The below examples shows the unit configuration files available at -/usr/lib/systemd/system which are distributed by installed RPM packages. +/usr/lib/systemd/system which are distributed by installed RPM packages. We are more interested in the configuration file that ends with service as these are service units.

    Managing System Services

    -

    Service units end with .service file extension. Systemctl command can be -used to start/stop/restart the services managed by systemd.

    +

    Service units end with .service file extension. systemctl command can be +used to start/stop/restart the services managed by systemd.

    diff --git a/level101/linux_networking/conclusion/index.html b/level101/linux_networking/conclusion/index.html index 4e76e93..947e449 100644 --- a/level101/linux_networking/conclusion/index.html +++ b/level101/linux_networking/conclusion/index.html @@ -2107,13 +2107,13 @@

    Conclusion

    -

    With this we have traversed through the TCP/IP stack completely. We hope there will be a different perspective when one opens any website in the browser post the course.

    +

    With this, we have traversed through the TCP/IP stack completely. We hope there will be a different perspective when one opens any website in the browser post the course.

    During the course we have also dissected what are common tasks in this pipeline which falls under the ambit of SRE.

    Post Training Exercises

      -
    1. Setup own DNS resolver in the dev environment which acts as an authoritative DNS server for example.com and forwarder for other domains. Update resolv.conf to use the new DNS resolver running in localhost
    2. -
    3. Set up a site dummy.example.com in localhost and run a webserver with a self signed certificate. Update the trusted CAs or pass self signed CA’s public key as a parameter so that curl https://dummy.example.com -v works properly without self signed cert warning
    4. -
    5. Update the routing table to use another host(container/VM) in the same network as a gateway for 8.8.8.8/32 and run ping 8.8.8.8. Do the packet capture on the new gateway to see L3 hop is working as expected(might need to disable icmp_redirect)
    6. +
    7. Set up your own DNS resolver in the dev environment which acts as an authoritative DNS server for example.com and forwarder for other domains. Update resolv.conf to use the new DNS resolver running in localhost.
    8. +
    9. Set up a site dummy.example.com in localhost and run a webserver with a self-signed certificate. Update the trusted CAs or pass self-signed CA’s public key as a parameter so that curl https://dummy.example.com -v works properly without self-signed cert warning.
    10. +
    11. Update the routing table to use another host (container/VM) in the same network as a gateway for 8.8.8.8/32 and run ping 8.8.8.8. Do the packet capture on the new gateway to see L3 hop is working as expected (might need to disable icmp_redirect).
    diff --git a/level101/linux_networking/dns/index.html b/level101/linux_networking/dns/index.html index e312415..2cff576 100644 --- a/level101/linux_networking/dns/index.html +++ b/level101/linux_networking/dns/index.html @@ -2151,12 +2151,12 @@

    DNS

    -

    Domain Names are the simple human-readable names for websites. The Internet understands only IP addresses, but since memorizing incoherent numbers is not practical, domain names are used instead. These domain names are translated into IP addresses by the DNS infrastructure. When somebody tries to open www.linkedin.com in the browser, the browser tries to convert www.linkedin.com to an IP Address. This process is called DNS resolution. A simple pseudocode depicting this process looks this

    +

    Domain Names are the simple human-readable names for websites. The Internet understands only IP addresses, but since memorizing incoherent numbers is not practical, domain names are used instead. These domain names are translated into IP addresses by the DNS infrastructure. When somebody tries to open www.linkedin.com in the browser, the browser tries to convert www.linkedin.com to an IP Address. This process is called DNS resolution. A simple pseudocode depicting this process looks this:

    ip, err = getIPAddress(domainName)
     if err:
    -  print(“unknown Host Exception while trying to resolve:%s”.format(domainName))
    +  print("unknown Host Exception while trying to resolve:%s".format(domainName))
     
    -

    Now let’s try to understand what happens inside the getIPAddress function. The browser would have a DNS cache of its own where it checks if there is a mapping for the domainName to an IP Address already available, in which case the browser uses that IP address. If no such mapping exists, the browser calls gethostbyname syscall to ask the operating system to find the IP address for the given domainName

    +

    Now, let’s try to understand what happens inside the getIPAddress function. The browser would have a DNS cache of its own where it checks if there is a mapping for the domainName to an IP Address already available, in which case the browser uses that IP address. If no such mapping exists, the browser calls gethostbyname syscall to ask the operating system to find the IP address for the given domainName.

    def getIPAddress(domainName):
         resp, fail = lookupCache(domainName)
         If not fail:
    @@ -2168,19 +2168,19 @@ if err:
            else:
               return resp
     
    -

    Now lets understand what operating system kernel does when the gethostbyname function is called. The Linux operating system looks at the file /etc/nsswitch.conf file which usually has a line

    +

    Now, let's understand what operating system kernel does when the gethostbyname function is called. The Linux operating system looks at the file /etc/nsswitch.conf file which usually has a line.

    hosts:      files dns
     
    -

    This line means the OS has to look up first in file (/etc/hosts) and then use DNS protocol to do the resolution if there is no match in /etc/hosts.

    -

    The file /etc/hosts is of format

    +

    This line means the OS has to look up first in file (/etc/hosts) and then use DNS protocol to do the resolution if there is no match in /etc/hosts.

    +

    The file /etc/hosts is of format:

    IPAddress FQDN [FQDN].*

    127.0.0.1 localhost.localdomain localhost
     ::1 localhost.localdomain localhost
     
    -

    If a match exists for a domain in this file then that IP address is returned by the OS. Lets add a line to this file

    +

    If a match exists for a domain in this file, then that IP address is returned by the OS. Let's add a line to this file:

    127.0.0.1 test.linkedin.com
     
    -

    And then do ping test.linkedin.com

    +

    And then do ping test.linkedin.com.

    ping test.linkedin.com -n
     
    PING test.linkedin.com (127.0.0.1) 56(84) bytes of data.
    @@ -2189,11 +2189,11 @@ if err:
     64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.037 ms
     
     
    -

    As mentioned earlier, if no match exists in /etc/hosts, the OS tries to do a DNS resolution using the DNS protocol. The linux system makes a DNS request to the first IP in /etc/resolv.conf. If there is no response, requests are sent to subsequent servers in resolv.conf. These servers in resolv.conf are called DNS resolvers. The DNS resolvers are populated by DHCP or statically configured by an administrator. +

    As mentioned earlier, if no match exists in /etc/hosts, the OS tries to do a DNS resolution using the DNS protocol. The Linux system makes a DNS request to the first IP in /etc/resolv.conf. If there is no response, requests are sent to subsequent servers in resolv.conf. These servers in resolv.conf are called DNS resolvers. The DNS resolvers are populated by DHCP or statically configured by an administrator. Dig is a userspace DNS system which creates and sends request to DNS resolvers and prints the response it receives to the console.

    -
    #run this command in one shell to capture all DNS requests
    +
    # run this command in one shell to capture all DNS requests
     sudo tcpdump -s 0 -A -i any port 53
    -#make a dig request from another shell
    +# make a dig request from another shell
     dig linkedin.com
     
    13:19:54.432507 IP 172.19.209.122.56497 > 172.23.195.101.53: 527+ [1au] A? linkedin.com. (41)
    @@ -2203,21 +2203,21 @@ dig linkedin.com
     
     ..)........
     
    -

    The packet capture shows a request is made to 172.23.195.101:53 (this is the resolver in /etc/resolv.conf) for linkedin.com and a response is received from 172.23.195.101 with the IP address of linkedin.com 108.174.10.10

    -

    Now let's try to understand how DNS resolver tries to find the IP address of linkedin.com. DNS resolver first looks at its cache. Since many devices in the network can query for the domain name linkedin.com, the name resolution result may already exist in the cache. If there is a cache miss, it starts the DNS resolution process. The DNS server breaks “linkedin.com” to “.”, “com.” and “linkedin.com.” and starts DNS resolution from “.”. The “.” is called root domain and those IPs are known to the DNS resolver software. DNS resolver queries the root domain nameservers to find the right top-level domain (TLD) nameservers which could respond regarding details for "com.". The address of the TLD nameserver of “com.” is returned. Now the DNS resolution service contacts the TLD nameserver for “com.” to fetch the authoritative nameserver for “linkedin.com”. Once an authoritative nameserver of “linkedin.com” is known, the resolver contacts Linkedin’s nameserver to provide the IP address of “linkedin.com”. This whole process can be visualized by running the following -

    +

    The packet capture shows a request is made to 172.23.195.101:53 (this is the resolver in /etc/resolv.conf) for linkedin.com and a response is received from 172.23.195.101 with the IP address of linkedin.com 108.174.10.10.

    +

    Now, let's try to understand how DNS resolver tries to find the IP address of linkedin.com. DNS resolver first looks at its cache. Since many devices in the network can query for the domain name linkedin.com, the name resolution result may already exist in the cache. If there is a cache miss, it starts the DNS resolution process. The DNS server breaks “linkedin.com” to “.”, “com.” and “linkedin.com.” and starts DNS resolution from “.”. The “.” is called root domain and those IPs are known to the DNS resolver software. DNS resolver queries the root domain nameservers to find the right top-level domain (TLD) nameservers which could respond regarding details for "com.". The address of the TLD nameserver of “com.” is returned. Now the DNS resolution service contacts the TLD nameserver for “com.” to fetch the authoritative nameserver for “linkedin.com”. Once an authoritative nameserver of “linkedin.com” is known, the resolver contacts LinkedIn’s nameserver to provide the IP address of “linkedin.com”. This whole process can be visualized by running the following:

    dig +trace linkedin.com
     
    linkedin.com.       3600    IN  A   108.174.10.10
     
    -

    This DNS response has 5 fields where the first field is the request and the last field is the response. The second field is the Time to Live which says how long the DNS response is valid in seconds. In this case this mapping of linkedin.com is valid for 1 hour. This is how the resolvers and application(browser) maintain their cache. Any request for linkedin.com beyond 1 hour will be treated as a cache miss as the mapping has expired its TTL and the whole process has to be redone. +

    This DNS response has 5 fields where the first field is the request and the last field is the response. The second field is the Time-to-Live (TTL) which says how long the DNS response is valid in seconds. In this case, this mapping of linkedin.com is valid for 1 hour. This is how the resolvers and application (browser) maintain their cache. Any request for linkedin.com beyond 1 hour will be treated as a cache miss as the mapping has expired its TTL and the whole process has to be redone. The 4th field says the type of DNS response/request. Some of the various DNS query types are A, AAAA, NS, TXT, PTR, MX and CNAME.

    • A record returns IPV4 address of the domain name
    • AAAA record returns the IPV6 address of the domain Name
    • NS record returns the authoritative nameserver for the domain name
    • -
    • CNAME records are aliases to the domain names. Some domains point to other domain names and resolving the latter domain name gives an IP which is used as an IP for the former domain name as well. Example www.linkedin.com’s IP address is the same as 2-01-2c3e-005a.cdx.cedexis.net.
    • -
    • For the brevity we are not discussing other DNS record types, the RFC of each of these records are available here.
    • +
    • CNAME records are aliases to the domain names. Some domains point to other domain names and resolving the latter domain name gives an IP which is used as an IP for the former domain name as well. Example www.linkedin.com’s IP address is the same as 2-01-2c3e-005a.cdx.cedexis.net.
    • +
    • For the brevity, we are not discussing other DNS record types, the RFC of each of these records are available here.
    dig A linkedin.com +short
     108.174.10.10
    @@ -2240,16 +2240,16 @@ dns1.p09.nsone.net.
     dig www.linkedin.com CNAME +short
     2-01-2c3e-005a.cdx.cedexis.net.
     
    -

    Armed with these fundamentals of DNS lets see usecases where DNS is used by SREs.

    +

    Armed with these fundamentals of DNS lets see use cases where DNS is used by SREs.

    Applications in SRE role

    -

    This section covers some of the common solutions SRE can derive from DNS

    +

    This section covers some of the common solutions SRE can derive from DNS.

      -
    1. Every company has to have its internal DNS infrastructure for intranet sites and internal services like databases and other internal applications like wiki. So there has to be a DNS infrastructure maintained for those domain names by the infrastructure team. This DNS infrastructure has to be optimized and scaled so that it doesn’t become a single point of failure. Failure of the internal DNS infrastructure can cause API calls of microservices to fail and other cascading effects.
    2. -
    3. DNS can also be used for discovering services. For example the hostname serviceb.internal.example.com could list instances which run service b internally in example.com company. Cloud providers provide options to enable DNS discovery(example)
    4. -
    5. DNS is used by cloud providers and CDN providers to scale their services. In Azure/AWS, Load Balancers are given a CNAME instead of IPAddress. They update the IPAddress of the Loadbalancers as they scale by changing the IP Address of alias domain names. This is one of the reasons why A records of such alias domains are short lived like 1 minute.
    6. +
    7. Every company has to have its internal DNS infrastructure for intranet sites and internal services like databases and other internal applications like Wiki. So there has to be a DNS infrastructure maintained for those domain names by the infrastructure team. This DNS infrastructure has to be optimized and scaled so that it doesn’t become a single point of failure. Failure of the internal DNS infrastructure can cause API calls of microservices to fail and other cascading effects.
    8. +
    9. DNS can also be used for discovering services. For example the hostname serviceb.internal.example.com could list instances which run service b internally in example.com company. Cloud providers provide options to enable DNS discovery (example).
    10. +
    11. DNS is used by cloud providers and CDN providers to scale their services. In Azure/AWS, Load Balancers are given a CNAME instead of IPAddress. They update the IPAddress of the Loadbalancers as they scale by changing the IP Address of alias domain names. This is one of the reasons why A records of such alias domains are short-lived like 1 minute.
    12. DNS can also be used to make clients get IP addresses closer to their location so that their HTTP calls can be responded faster if the company has a presence geographically distributed.
    13. -
    14. SRE also has to understand since there is no verification in DNS infrastructure, these responses can be spoofed. This is safeguarded by other protocols like HTTPS(dealt later). DNSSEC protects from forged or manipulated DNS responses.
    15. -
    16. Stale DNS cache can be a problem. Some apps might still be using expired DNS records for their api calls. This is something SRE has to be wary of when doing maintenance.
    17. +
    18. SRE also has to understand since there is no verification in DNS infrastructure, these responses can be spoofed. This is safeguarded by other protocols like HTTPS (dealt later). DNSSEC protects from forged or manipulated DNS responses.
    19. +
    20. Stale DNS cache can be a problem. Some apps might still be using expired DNS records for their API calls. This is something SRE has to be wary of when doing maintenance.
    21. DNS Loadbalancing and service discovery also has to understand TTL and the servers can be removed from the pool only after waiting till TTL post the changes are made to DNS records. If this is not done, a certain portion of the traffic will fail as the server is removed before the TTL.
    diff --git a/level101/linux_networking/http/index.html b/level101/linux_networking/http/index.html index 2d16c3d..b26fc55 100644 --- a/level101/linux_networking/http/index.html +++ b/level101/linux_networking/http/index.html @@ -2107,9 +2107,9 @@

    HTTP

    -

    Till this point we have only got the IP address of linkedin.com. The HTML page of linkedin.com is served by HTTP protocol which the browser renders. Browser sends a HTTP request to the IP of the server determined above. -Request has a verb GET, PUT, POST followed by a path and query parameters and lines of key value pair which gives information about the client and capabilities of the client like contents it can accept and a body (usually in POST or PUT)

    -
    # Eg run the following in your container and have a look at the headers 
    +

    Till this point we have only got the IP address of linkedin.com. The HTML page of linkedin.com is served by HTTP protocol which the browser renders. Browser sends a HTTP request to the IP of the server determined above. +Request has a verb GET, PUT, POST followed by a path and query parameters and lines of key-value pair which gives information about the client and capabilities of the client like contents it can accept and a body (usually in POST or PUT).

    +
    # Eg. run the following in your container and have a look at the headers 
     curl linkedin.com -v
     
    * Connected to linkedin.com (108.174.10.10) port 80 (#0)
    @@ -2128,7 +2128,7 @@ curl linkedin.com -v
     * Connection #0 to host linkedin.com left intact
     * Closing connection 0
     
    -

    Here, in the first line GET is the verb, / is the path and 1.1 is the HTTP protocol version. Then there are key value pairs which give client capabilities and some details to the server. The server responds back with HTTP version, Status Code and Status message. Status codes 2xx means success, 3xx denotes redirection, 4xx denotes client side errors and 5xx server side errors.

    +

    Here, in the first line GET is the verb, / is the path and 1.1 is the HTTP protocol version. Then there are key-value pairs which give client capabilities and some details to the server. The server responds back with HTTP version, Status Code and Status message. Status codes 2xx means success, 3xx denotes redirection, 4xx denotes client-side errors and 5xx server-side errors.

    We will now jump in to see the difference between HTTP/1.0 and HTTP/1.1.

    #On the terminal type
     telnet  www.linkedin.com 80
    @@ -2138,10 +2138,10 @@ HOST:linkedin.com
     USER-AGENT: curl
     
     
    -

    This would get server response and waits for next input as the underlying connection to www.linkedin.com can be reused for further queries. While going through TCP, we can understand the benefits of this. But in HTTP/1.0 this connection will be immediately closed after the response meaning new connection has to be opened for each query. HTTP/1.1 can have only one inflight request in an open connection but connection can be reused for multiple requests one after another. One of the benefits of HTTP/2.0 over HTTP/1.1 is we can have multiple inflight requests on the same connection. We are restricting our scope to generic HTTP and not jumping to the intricacies of each protocol version but they should be straight forward to understand post the course.

    -

    HTTP is called stateless protocol. This section we will try to understand what stateless means. Say we logged in to linkedin.com, each request to linkedin.com from the client will have no context of the user and it makes no sense to prompt user to login for each page/resource. This problem of HTTP is solved by COOKIE. A user is created a session when a user logs in. This session identifier is sent to the browser via SET-COOKIE header. The browser stores the COOKIE till the expiry set by the server and sends the cookie for each request from hereon for linkedin.com. More details on cookies are available here. Cookies are a critical piece of information like password and since HTTP is a plain text protocol, any man in the middle can capture either password or cookies and can breach the privacy of the user. Similarly as discussed during DNS a spoofed IP of linkedin.com can cause a phishing attack on users where an user can give linkedin’s password to login on the malicious site. To solve both problems HTTPs came in place and HTTPs has to be mandated.

    -

    HTTPS has to provide server identification and encryption of data between client and server. The server administrator has to generate a private public key pair and certificate request. This certificate request has to be signed by a certificate authority which converts the certificate request to a certificate. The server administrator has to update the certificate and private key to the webserver. The certificate has details about the server (like domain name for which it serves, expiry date), public key of the server. The private key is a secret to the server and losing the private key loses the trust the server provides. When clients connect, the client sends a HELLO. The server sends its certificate to the client. The client checks the validity of the cert by seeing if it is within its expiry time, if it is signed by a trusted authority and the hostname in the cert is the same as the server. This validation makes sure the server is the right server and there is no phishing. Once that is validated, the client negotiates a symmetrical key and cipher with the server by encrypting the negotiation with the public key of the server. Nobody else other than the server who has the private key can understand this data. Once negotiation is complete, that symmetric key and algorithm is used for further encryption which can be decrypted only by client and server from thereon as they only know the symmetric key and algorithm. The switch to symmetric algorithm from asymmetric encryption algorithm is to not strain the resources of client devices as symmetric encryption is generally less resource intensive than asymmetric.

    -
    #Try the following on your terminal to see the cert details like Subject Name(domain name), Issuer details, Expiry date
    +

    This would get server response and waits for next input as the underlying connection to www.linkedin.com can be reused for further queries. While going through TCP, we can understand the benefits of this. But in HTTP/1.0, this connection will be immediately closed after the response meaning new connection has to be opened for each query. HTTP/1.1 can have only one inflight request in an open connection but connection can be reused for multiple requests one after another. One of the benefits of HTTP/2.0 over HTTP/1.1 is we can have multiple inflight requests on the same connection. We are restricting our scope to generic HTTP and not jumping to the intricacies of each protocol version but they should be straight forward to understand post the course.

    +

    HTTP is called stateless protocol. This section we will try to understand what stateless means. Say we logged in to linkedin.com, each request to linkedin.com from the client will have no context of the user and it makes no sense to prompt user to login for each page/resource. This problem of HTTP is solved by COOKIE. A user is created a session when a user logs in. This session identifier is sent to the browser via SET-COOKIE header. The browser stores the COOKIE till the expiry set by the server and sends the cookie for each request from hereon for linkedin.com. More details on cookies are available here. Cookies are a critical piece of information like password and since HTTP is a plain text protocol, any man-in-the-middle can capture either password or cookies and can breach the privacy of the user. Similarly as discussed during DNS, a spoofed IP of linkedin.com can cause a phishing attack on users where an user can give LinkedIn’s password to login on the malicious site. To solve both problems, HTTPS came in place and HTTPS has to be mandated.

    +

    HTTPS has to provide server identification and encryption of data between client and server. The server administrator has to generate a private-public key pair and certificate request. This certificate request has to be signed by a certificate authority which converts the certificate request to a certificate. The server administrator has to update the certificate and private key to the webserver. The certificate has details about the server (like domain name for which it serves, expiry date), public key of the server. The private key is a secret to the server and losing the private key loses the trust the server provides. When clients connect, the client sends a HELLO. The server sends its certificate to the client. The client checks the validity of the cert by seeing if it is within its expiry time, if it is signed by a trusted authority and the hostname in the cert is the same as the server. This validation makes sure the server is the right server and there is no phishing. Once that is validated, the client negotiates a symmetrical key and cipher with the server by encrypting the negotiation with the public key of the server. Nobody else other than the server who has the private key can understand this data. Once negotiation is complete, that symmetric key and algorithm is used for further encryption which can be decrypted only by client and server from thereon as they only know the symmetric key and algorithm. The switch to symmetric algorithm from asymmetric encryption algorithm is to not strain the resources of client devices as symmetric encryption is generally less resource intensive than asymmetric.

    +
    # Try the following on your terminal to see the cert details like Subject Name (domain name), Issuer details, Expiry date
     curl https://www.linkedin.com -v 
     
    * Connected to www.linkedin.com (13.107.42.14) port 443 (#0)
    @@ -2216,7 +2216,7 @@ date: Mon, 09 Nov 2020 10:50:10 GMT
     
     * Closing connection 0
     
    -

    Here my system has a list of certificate authorities it trusts in this file /etc/ssl/cert.pem. Curl validates the certificate is for www.linkedin.com by seeing the CN section of the subject part of the certificate. It also makes sure the certificate is not expired by seeing the expire date. It also validates the signature on the certificate by using the public key of issuer Digicert in /etc/ssl/cert.pem. Once this is done, using the public key of www.linkedin.com it negotiates cipher TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 with a symmetric key. Subsequent data transfer including first HTTP request uses the same cipher and symmetric key.

    +

    Here, my system has a list of certificate authorities it trusts in this file /etc/ssl/cert.pem. cURL validates the certificate is for www.linkedin.com by seeing the CN section of the subject part of the certificate. It also makes sure the certificate is not expired by seeing the expire date. It also validates the signature on the certificate by using the public key of issuer Digicert in /etc/ssl/cert.pem. Once this is done, using the public key of www.linkedin.com it negotiates cipher TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 with a symmetric key. Subsequent data transfer including first HTTP request uses the same cipher and symmetric key.

    diff --git a/level101/linux_networking/intro/index.html b/level101/linux_networking/intro/index.html index 9d2bc1b..e371a3e 100644 --- a/level101/linux_networking/intro/index.html +++ b/level101/linux_networking/intro/index.html @@ -2209,15 +2209,15 @@

    Linux Networking Fundamentals

    Prerequisites

      -
    • High-level knowledge of commonly used jargon in TCP/IP stack like DNS, TCP, UDP and HTTP
    • +
    • High-level knowledge of commonly used jargon in TCP/IP stack like DNS, TCP, UDP and HTTP
    • Linux Commandline Basics

    What to expect from this course

    Throughout the course, we cover how an SRE can optimize the system to improve their web stack performance and troubleshoot if there is an issue in any of the layers of the networking stack. This course tries to dig through each layer of traditional TCP/IP stack and expects an SRE to have a picture beyond the bird’s eye view of the functioning of the Internet.

    What is not covered under this course

    -

    This course spends time on the fundamentals. We are not covering concepts like HTTP/2.0, QUIC, TCP congestion control protocols, Anycast, BGP, CDN, Tunnels and Multicast. We expect that this course will provide the relevant basics to understand such concepts

    +

    This course spends time on the fundamentals. We are not covering concepts like HTTP/2.0, QUIC, TCP congestion control protocols, Anycast, BGP, CDN, Tunnels and Multicast. We expect that this course will provide the relevant basics to understand such concepts.

    Birds eye view of the course

    -

    The course covers the question “What happens when you open linkedin.com in your browser?” The course follows the flow of TCP/IP stack.More specifically, the course covers topics of Application layer protocols DNS and HTTP, transport layer protocols UDP and TCP, networking layer protocol IP and Data Link Layer protocol

    +

    The course covers the question “What happens when you open linkedin.com in your browser?” The course follows the flow of TCP/IP stack. More specifically, the course covers topics of Application layer protocols (DNS and HTTP), transport layer protocols (UDP and TCP), networking layer protocol (IP) and data link layer protocol.

    Course Contents

    1. DNS
    2. diff --git a/level101/linux_networking/ipr/index.html b/level101/linux_networking/ipr/index.html index f6a4ab7..67d70f2 100644 --- a/level101/linux_networking/ipr/index.html +++ b/level101/linux_networking/ipr/index.html @@ -2151,8 +2151,8 @@

      IP Routing and Data Link Layer

      -

      We will dig how packets that leave the client reach the server and vice versa. When the packet reaches the IP layer, the transport layer populates source port, destination port. IP/Network layer populates destination IP(discovered from DNS) and then looks up the route to the destination IP on the routing table.

      -
      #Linux route -n command gives the default routing table
      +

      We will dig how packets that leave the client reach the server and vice versa. When the packet reaches the IP layer, the transport layer populates source port, destination port. IP/Network layer populates destination IP (discovered from DNS) and then looks up the route to the destination IP on the routing table.

      +
      # Linux `route -n` command gives the default routing table
       route -n
       
      Kernel IP routing table
      @@ -2160,18 +2160,18 @@ Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
       0.0.0.0         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
       172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0
       
      -

      Here the destination IP is bitwise AND’d with the Genmask and if the answer is the destination part of the table then that gateway and interface is picked for routing. Here linkedin.com’s IP 108.174.10.10 is AND’d with 255.255.255.0 and the answer we get is 108.174.10.0 which doesn’t match with any destination in the routing table. Then Linux does an AND of destination IP with 0.0.0.0 and we get 0.0.0.0. This answer matches the default row

      -

      Routing table is processed in the order of more octets of 1 set in genmask and genmask 0.0.0.0 is the default route if nothing matches. -At the end of this operation Linux figured out that the packet has to be sent to next hop 172.17.0.1 via eth0. The source IP of the packet will be set as the IP of interface eth0. -Now to send the packet to 172.17.0.1 linux has to figure out the MAC address of 172.17.0.1. MAC address is figured by looking at the internal arp cache which stores translation between IP address and MAC address. If there is a cache miss, Linux broadcasts ARP request within the internal network asking who has 172.17.0.1. The owner of the IP sends an ARP response which is cached by the kernel and the kernel sends the packet to the gateway by setting Source mac address as mac address of eth0 and destination mac address of 172.17.0.1 which we got just now. Similar routing lookup process is followed in each hop till the packet reaches the actual server. Transport layer and layers above it come to play only at end servers. During intermediate hops only till the IP/Network layer is involved.

      +

      Here, the destination IP is bitwise AND’d with the Genmask and if the answer is the destination part of the table, then that gateway and interface is picked for routing. Here, linkedin.com’s IP 108.174.10.10 is AND’d with 255.255.255.0 and the answer we get is 108.174.10.0 which doesn’t match with any destination in the routing table. Then, Linux does an AND of destination IP with 0.0.0.0 and we get 0.0.0.0. This answer matches the default row.

      +

      Routing table is processed in the order of more octets of 1 set in Genmask and Genmask 0.0.0.0 is the default route if nothing matches. +At the end of this operation, Linux figured out that the packet has to be sent to next hop 172.17.0.1 via eth0. The source IP of the packet will be set as the IP of interface eth0. +Now, to send the packet to 172.17.0.1, Linux has to figure out the MAC address of 172.17.0.1. MAC address is figured by looking at the internal ARP cache which stores translation between IP address and MAC address. If there is a cache miss, Linux broadcasts ARP request within the internal network asking who has 172.17.0.1. The owner of the IP sends an ARP response which is cached by the kernel and the kernel sends the packet to the gateway by setting Source MAC address as MAC address of eth0 and destination MAC address of 172.17.0.1 which we got just now. Similar routing lookup process is followed in each hop till the packet reaches the actual server. Transport layer and layers above it come to play only at end servers. During intermediate hops, only till the IP/Network layer is involved.

      Screengrab for above explanation

      -

      One weird gateway we saw in the routing table is 0.0.0.0. This gateway means no Layer3(Network layer) hop is needed to send the packet. Both source and destination are in the same network. Kernel has to figure out the mac of the destination and populate source and destination mac appropriately and send the packet out so that it reaches the destination without any Layer3 hop in the middle

      -

      As we followed in other modules, lets complete this session with SRE usecases

      +

      One weird gateway we saw in the routing table is 0.0.0.0. This gateway means no Layer3 (Network layer) hop is needed to send the packet. Both source and destination are in the same network. Kernel has to figure out the MAC of the destination and populate source and destination MAC appropriately and send the packet out so that it reaches the destination without any Layer3 hop in the middle.

      +

      As we followed in other modules, let's complete this session with SRE use cases.

      Applications in SRE role

        -
      1. Generally the routing table is populated by DHCP and playing around is not a good practice. There can be reasons where one has to play around the routing table but take that path only when it's absolutely necessary
      2. -
      3. Understanding error messages better like, “No route to host” error can mean mac address of the destination host is not found and it can mean the destination host is down
      4. -
      5. On rare cases looking at the ARP table can help us understand if there is a IP conflict where same IP is assigned to two hosts by mistake and this is causing unexpected behavior
      6. +
      7. Generally the routing table is populated by DHCP and playing around is not a good practice. There can be reasons where one has to play around the routing table but take that path only when it's absolutely necessary.
      8. +
      9. Understanding error messages better like, “No route to host” error can mean MAC address of the destination host is not found and it can mean the destination host is down.
      10. +
      11. On rare cases, looking at the ARP table can help us understand if there is a IP conflict where same IP is assigned to two hosts by mistake and this is causing unexpected behavior.
      diff --git a/level101/linux_networking/tcp/index.html b/level101/linux_networking/tcp/index.html index 2e3051d..adf0e3d 100644 --- a/level101/linux_networking/tcp/index.html +++ b/level101/linux_networking/tcp/index.html @@ -2152,27 +2152,27 @@

      TCP

      TCP is a transport layer protocol like UDP but it guarantees reliability, flow control and congestion control. -TCP guarantees reliable delivery by using sequence numbers. A TCP connection is established by a three way handshake. In our case, the client sends a SYN packet along with the starting sequence number it plans to use, the server acknowledges the SYN packet and sends a SYN with its sequence number. Once the client acknowledges the syn packet, the connection is established. Each data transferred from here on is considered delivered reliably once acknowledgement for that sequence is received by the concerned party

      +TCP guarantees reliable delivery by using sequence numbers. A TCP connection is established by a three-way handshake. In our case, the client sends a SYN packet along with the starting sequence number it plans to use, the server acknowledges the SYN packet and sends a SYN with its sequence number. Once the client acknowledges the SYN packet, the connection is established. Each data transferred from here on is considered delivered reliably once acknowledgement for that sequence is received by the concerned party.

      3-way handshake

      -
      #To understand handshake run packet capture on one bash session
      +
      # To understand handshake run packet capture on one bash session
       tcpdump -S -i any port 80
      -#Run curl on one bash session
      +# Run curl on one bash session
       curl www.linkedin.com
       

      tcpdump-3way

      -

      Here client sends a syn flag shown by [S] flag with a sequence number 1522264672. The server acknowledges receipt of SYN with an ack [.] flag and a Syn flag for its sequence number[S]. The server uses the sequence number 1063230400 and acknowledges the client it’s expecting sequence number 1522264673 (client sequence+1). Client sends a zero length acknowledgement packet to the server(server sequence+1) and connection stands established. This is called three way handshake. The client sends a 76 bytes length packet after this and increments its sequence number by 76. Server sends a 170 byte response and closes the connection. This was the difference we were talking about between HTTP/1.1 and HTTP/1.0. In HTTP/1.1 this same connection can be reused which reduces overhead of 3 way handshake for each HTTP request. If a packet is missed between client and server, server won’t send an ack to the client and client would retry sending the packet till the ACK is received. This guarantees reliability. -The flow control is established by the win size field in each segment. The win size says available TCP buffer length in the kernel which can be used to buffer received segments. A size 0 means the receiver has a lot of lag to catch from its socket buffer and the sender has to pause sending packets so that receiver can cope up. This flow control protects from slow receiver and fast sender problem

      -

      TCP also does congestion control which determines how many segments can be in transit without an ack. Linux provides us the ability to configure algorithms for congestion control which we are not covering here.

      -

      While closing a connection, client/server calls a close syscall. Let's assume client do that. Client’s kernel will send a FIN packet to the server. Server’s kernel can’t close the connection till the close syscall is called by the server application. Once server app calls close, server also sends a FIN packet and client enters into time wait state for 2*MSS(120s) so that this socket can’t be reused for that time period to prevent any TCP state corruptions due to stray stale packets.

      +

      Here, client sends a SYN flag shown by [S] flag with a sequence number 1522264672. The server acknowledges receipt of SYN with an ACK [.] flag and a SYN flag for its sequence number [S]. The server uses the sequence number 1063230400 and acknowledges the client it's expecting sequence number 1522264673 (client sequence + 1). Client sends a zero length acknowledgement packet to the server (server sequence + 1) and connection stands established. This is called three way handshake. The client sends a 76 bytes length packet after this and increments its sequence number by 76. Server sends a 170 byte response and closes the connection. This was the difference we were talking about between HTTP/1.1 and HTTP/1.0. In HTTP/1.1, this same connection can be reused which reduces overhead of three-way handshake for each HTTP request. If a packet is missed between client and server, server won’t send an ACK to the client and client would retry sending the packet till the ACK is received. This guarantees reliability. +The flow control is established by the WIN size field in each segment. The WIN size says available TCP buffer length in the kernel which can be used to buffer received segments. A size 0 means the receiver has a lot of lag to catch from its socket buffer and the sender has to pause sending packets so that receiver can cope up. This flow control protects from slow receiver and fast sender problem.

      +

      TCP also does congestion control which determines how many segments can be in transit without an ACK. Linux provides us the ability to configure algorithms for congestion control which we are not covering here.

      +

      While closing a connection, client/server calls a close syscall. Let's assume client do that. Client’s kernel will send a FIN packet to the server. Server’s kernel can’t close the connection till the close syscall is called by the server application. Once server app calls close, server also sends a FIN packet and client enters into TIME_WAIT state for 2*MSS (120s) so that this socket can’t be reused for that time period to prevent any TCP state corruptions due to stray stale packets.

      Connection tearing

      -

      Armed with our TCP and HTTP knowledge lets see how this is used by SREs in their role

      +

      Armed with our TCP and HTTP knowledge, let's see how this is used by SREs in their role.

      Applications in SRE role

      1. Scaling HTTP performance using load balancers need consistent knowledge about both TCP and HTTP. There are different kinds of load balancing like L4, L7 load balancing, Direct Server Return etc. HTTPs offloading can be done on Load balancer or directly on servers based on the performance and compliance needs.
      2. -
      3. Tweaking sysctl variables for rmem and wmem like we did for UDP can improve throughput of sender and receiver.
      4. -
      5. Sysctl variable tcp_max_syn_backlog and socket variable somax_conn determines how many connections for which the kernel can complete 3 way handshake before app calling accept syscall. This is much useful in single threaded applications. Once the backlog is full, new connections stay in SYN_RCVD state (when you run netstat) till the application calls accept syscall
      6. -
      7. Apps can run out of file descriptors if there are too many short lived connections. Digging through tcp_reuse and tcp_recycle can help reduce time spent in the time wait state(it has its own risk). Making apps reuse a pool of connections instead of creating ad hoc connection can also help
      8. -
      9. Understanding performance bottlenecks by seeing metrics and classifying whether its a problem in App or network side. Example too many sockets in Close_wait state is a problem on application whereas retransmissions can be a problem more on network or on OS stack than the application itself. Understanding the fundamentals can help us narrow down where the bottleneck is
      10. +
      11. Tweaking sysctl variables for rmem and wmem like we did for UDP can improve throughput of sender and receiver.
      12. +
      13. sysctl variable tcp_max_syn_backlog and socket variable somax_conn determines how many connections for which the kernel can complete 3-way handshake before app calling accept syscall. This is much useful in single-threaded applications. Once the backlog is full, new connections stay in SYN_RCVD state (when you run netstat) till the application calls accept syscall.
      14. +
      15. Apps can run out of file descriptors if there are too many short-lived connections. Digging through tcp_reuse and tcp_recycle can help reduce time spent in the TIME_WAIT state (it has its own risk). Making apps reuse a pool of connections instead of creating ad hoc connection can also help.
      16. +
      17. Understanding performance bottlenecks by seeing metrics and classifying whether it's a problem in App or network side. Example too many sockets in CLOSE_WAIT state is a problem on application whereas retransmissions can be a problem more on network or on OS stack than the application itself. Understanding the fundamentals can help us narrow down where the bottleneck is.
      diff --git a/level101/linux_networking/udp/index.html b/level101/linux_networking/udp/index.html index b176a76..3c9287e 100644 --- a/level101/linux_networking/udp/index.html +++ b/level101/linux_networking/udp/index.html @@ -2151,13 +2151,13 @@

      UDP

      -

      UDP is a transport layer protocol. DNS is an application layer protocol that runs on top of UDP(most of the times). Before jumping into UDP, let's try to understand what an application and transport layer is. DNS protocol is used by a DNS client(eg dig) and DNS server(eg named). The transport layer makes sure the DNS request reaches the DNS server process and similarly the response reaches the DNS client process. Multiple processes can run on a system and they can listen on any ports. DNS servers usually listen on port number 53. When a client makes a DNS request, after filling the necessary application payload, it passes the payload to the kernel via sendto system call. The kernel picks a random port number(>1024) as source port number and puts 53 as destination port number and sends the packet to lower layers. When the kernel on server side receives the packet, it checks the port number and queues the packet to the application buffer of the DNS server process which makes a recvfrom system call and reads the packet. This process by the kernel is called multiplexing(combining packets from multiple applications to same lower layers) and demultiplexing(segregating packets from single lower layer to multiple applications). Multiplexing and Demultiplexing is done by the Transport layer.

      -

      UDP is one of the simplest transport layer protocol and it does only multiplexing and demultiplexing. Another common transport layer protocol TCP does a bunch of other things like reliable communication, flow control and congestion control. UDP is designed to be lightweight and handle communications with little overhead. So it doesn’t do anything beyond multiplexing and demultiplexing. If applications running on top of UDP need any of the features of TCP, they have to implement that in their application

      -

      This example from python wiki covers a sample UDP client and server where “Hello World” is an application payload sent to server listening on port number 5005. The server receives the packet and prints the “Hello World” string from the client

      +

      UDP is a transport layer protocol. DNS is an application layer protocol that runs on top of UDP (most of the times). Before jumping into UDP, let's try to understand what an application and transport layer is. DNS protocol is used by a DNS client (eg dig) and DNS server (eg named). The transport layer makes sure the DNS request reaches the DNS server process and similarly the response reaches the DNS client process. Multiple processes can run on a system and they can listen on any ports. DNS servers usually listen on port number 53. When a client makes a DNS request, after filling the necessary application payload, it passes the payload to the kernel via sendto system call. The kernel picks a random port number (>1024) as source port number and puts 53 as destination port number and sends the packet to lower layers. When the kernel on server-side receives the packet, it checks the port number and queues the packet to the application buffer of the DNS server process which makes a recvfrom system call and reads the packet. This process by the kernel is called multiplexing (combining packets from multiple applications to same lower layers) and demultiplexing (segregating packets from single lower layer to multiple applications). Multiplexing and Demultiplexing is done by the Transport layer.

      +

      UDP is one of the simplest transport layer protocol and it does only multiplexing and demultiplexing. Another common transport layer protocol TCP does a bunch of other things like reliable communication, flow control and congestion control. UDP is designed to be lightweight and handle communications with little overhead. So, it doesn’t do anything beyond multiplexing and demultiplexing. If applications running on top of UDP need any of the features of TCP, they have to implement that in their application.

      +

      This example from python wiki covers a sample UDP client and server where “Hello World” is an application payload sent to server listening on port number 5005. The server receives the packet and prints the “Hello World” string from the client.

      Applications in SRE role

        -
      1. If the underlying network is slow and the UDP layer is unable to queue packets down to the networking layer, sendto syscall from the application will hang till the kernel finds some of its buffer is freed. This can affect the throughput of the system. Increasing write memory buffer values using sysctl variables net.core.wmem_max and net.core.wmem_default provides some cushion to the application from the slow network
      2. -
      3. Similarly if the receiver process is slow in consuming from its buffer, the kernel has to drop packets which it can’t queue due to the buffer being full. Since UDP doesn’t guarantee reliability these dropped packets can cause data loss unless tracked by the application layer. Increasing sysctl variables rmem_default and rmem_max can provide some cushion to slow applications from fast senders.
      4. +
      5. If the underlying network is slow and the UDP layer is unable to queue packets down to the networking layer, sendto syscall from the application will hang till the kernel finds some of its buffer is freed. This can affect the throughput of the system. Increasing write memory buffer values using sysctl variables net.core.wmem_max and net.core.wmem_default provides some cushion to the application from the slow network
      6. +
      7. Similarly, if the receiver process is slow in consuming from its buffer, the kernel has to drop packets which it can’t queue due to the buffer being full. Since UDP doesn’t guarantee reliability these dropped packets can cause data loss unless tracked by the application layer. Increasing sysctl variables rmem_default and rmem_max can provide some cushion to slow applications from fast senders.
      diff --git a/level101/messagequeue/intro/index.html b/level101/messagequeue/intro/index.html index 0712060..4f615d9 100644 --- a/level101/messagequeue/intro/index.html +++ b/level101/messagequeue/intro/index.html @@ -2234,7 +2234,7 @@

      Messaging services

      What to expect from this course

      -

      At the end of training, you will have an understanding of what a Message Services is, learn about different types of Message Service implementation and understand some of the underlying concepts & trade offs.

      +

      At the end of training, you will have an understanding of what a Message Services is, learn about different types of Message Service implementation and understand some of the underlying concepts & trade-offs.

      What is not covered under this course

      We will not be deep diving into any specific Message Service.

      Course Contents

      diff --git a/level101/messagequeue/key_concepts/index.html b/level101/messagequeue/key_concepts/index.html index 3964999..4caec80 100644 --- a/level101/messagequeue/key_concepts/index.html +++ b/level101/messagequeue/key_concepts/index.html @@ -2287,7 +2287,7 @@

      Key Concepts

      -

      Lets looks at some of the key concepts when we talk about messaging system

      +

      Let's looks at some of the key concepts when we talk about messaging system

      Delivery guarantees

      One of the essential aspects of messaging services is ensuring that messages are delivered to their intended recipients. Different systems offer varying levels of delivery guarantees, and it is crucial to understand these guarantees to choose the right messaging service for your needs.

        diff --git a/level101/metrics_and_monitoring/alerts/index.html b/level101/metrics_and_monitoring/alerts/index.html index 541e6b1..86d7357 100644 --- a/level101/metrics_and_monitoring/alerts/index.html +++ b/level101/metrics_and_monitoring/alerts/index.html @@ -2109,11 +2109,11 @@

        Earlier we discussed different ways to collect key metric data points from a service and its underlying infrastructure. This data gives us a better understanding of how the service is performing. One of the main -objectives of monitoring is to detect any service degradations early +objectives of monitoring is to detect any service degradations early (reduce Mean Time To Detect) and notify stakeholders so that the issues are either avoided or can be fixed early, thus reducing Mean Time To Recover (MTTR). For example, if you are notified when resource usage by -a service exceeds 90 percent, you can take preventive measures to avoid +a service exceeds 90%, you can take preventive measures to avoid any service breakdown due to a shortage of resources. On the other hand, when a service goes down due to an issue, early detection and notification of such incidents can help you quickly fix the issue.

        @@ -2124,12 +2124,12 @@ notification of such incidents can help you quickly fix the issue.

        set up alerts on one or a combination of metrics to actively monitor the service health. These alerts have a set of defined rules or conditions, and when the rule is broken, you are notified. These rules can be as -simple as notifying when the metric value exceeds n to as complex as a -week over week (WoW) comparison of standard deviation over a period of +simple as notifying when the metric value exceeds n to as complex as a +week-over-week (WoW) comparison of standard deviation over a period of time. Monitoring tools notify you about an active alert, and most of these tools support instant messaging (IM) platforms, SMS, email, or phone calls. Figure 8 shows a sample alert notification received on -Slack for memory usage exceeding 90 percent of total RAM space on the +Slack for memory usage exceeding 90% of total RAM space on the host.

        diff --git a/level101/metrics_and_monitoring/best_practices/index.html b/level101/metrics_and_monitoring/best_practices/index.html index 01a9856..aa5f671 100644 --- a/level101/metrics_and_monitoring/best_practices/index.html +++ b/level101/metrics_and_monitoring/best_practices/index.html @@ -2110,22 +2110,22 @@ practices in mind.

        • -

          Use the right metric type -- Most of the libraries available +

          Use the right metric type—Most of the libraries available today offer various metric types. Choose the appropriate metric type for monitoring your system. Following are the types of metrics and their purposes.

          • -

            Gauge -- Gauge is a constant type of metric. After the +

            GaugeGauge is a constant type of metric. After the metric is initialized, the metric value does not change unless you intentionally update it.

          • -

            Timer -- Timer measures the time taken to complete a +

            TimerTimer measures the time taken to complete a task.

          • -

            Counter -- Counter counts the number of occurrences of a +

            CounterCounter counts the number of occurrences of a particular event.

          @@ -2135,19 +2135,19 @@ practices in mind.

          Types.

          • -

            Avoid over-monitoring -- Monitoring can be a significant - engineering endeavor. Therefore, be sure not to spend too +

            Avoid over-monitoring—Monitoring can be a significant + engineering endeavor. Therefore, be sure not to spend too much time and resources on monitoring services, yet make sure all important metrics are captured.

          • -

            Prevent alert fatigue -- Set alerts for metrics that are +

            Prevent alert fatigue—Set alerts for metrics that are important and actionable. If you receive too many non-critical alerts, you might start ignoring alert notifications over time. As a result, critical alerts might get overlooked.

          • -

            Have a runbook for alerts -- For every alert, make sure you have +

            Have a runbook for alerts—For every alert, make sure you have a document explaining what actions and checks need to be performed when the alert fires. This enables any engineer on the team to handle the alert and take necessary actions, without any help from diff --git a/level101/metrics_and_monitoring/command-line_tools/index.html b/level101/metrics_and_monitoring/command-line_tools/index.html index c0622f1..f3e6680 100644 --- a/level101/metrics_and_monitoring/command-line_tools/index.html +++ b/level101/metrics_and_monitoring/command-line_tools/index.html @@ -2112,28 +2112,28 @@ understand various subsystem statistics (CPU, memory, network, and so on). Let's look at some of the tools that are predominantly used.

            • -

              ps/top-- The process status command (ps) displays information +

              ps/top: The process status command (ps) displays information about all the currently running processes in a Linux system. The - top command is similar to the ps command, but it periodically + top command is similar to the ps command, but it periodically updates the information displayed until the program is terminated. - An advanced version of top, called htop, has a more user-friendly + An advanced version of top, called htop, has a more user-friendly interface and some additional features. These command-line utilities come with options to modify the operation and output of the command. Following are some important options supported by the - ps command.

              + ps command.

              • -

                -p <pid1, pid2,...> -- Displays information about processes +

                -p <pid1, pid2,...>: Displays information about processes that match the specified process IDs. Similarly, you can use -u <uid> and -g <gid> to display information about processes belonging to a specific user or group.

              • -

                -a -- Displays information about other users' processes, as well +

                -a: Displays information about other users' processes, as well as one's own.

              • -

                -x -- When displaying processes matched by other options, +

                -x: When displaying processes matched by other options, includes processes that do not have a controlling terminal.

              @@ -2145,21 +2145,21 @@ on). Let's look at some of the tools that are predominantly used.

              • -

                ss -- The socket statistics command (ss) displays information +

                ss: The socket statistics command (ss) displays information about network sockets on the system. This tool is the successor of netstat, which is deprecated. Following are some command-line options - supported by the ss command:

                + supported by the ss command:

                • -

                  -t -- Displays the TCP socket. Similarly, -u displays UDP +

                  -t: Displays the TCP socket. Similarly, -u displays UDP sockets, -x is for UNIX domain sockets, and so on.

                • -

                  -l -- Displays only listening sockets.

                  +

                  -l: Displays only listening sockets.

                • -

                  -n -- Instructs the command to not resolve service names. +

                  -n: Instructs the command to not resolve service names. Instead displays the port numbers.

                @@ -2168,7 +2168,7 @@ on). Let's look at some of the tools that are predominantly used.

                List of listening sockets on a system

                Figure 3: List of listening sockets on a system

                  -
                • free -- The free command displays memory usage statistics on the +
                • free: The free command displays memory usage statistics on the host like available memory, used memory, and free memory. Most often, this command is used with the -h command-line option, which displays the statistics in a human-readable format.
                • @@ -2177,7 +2177,7 @@ on). Let's look at some of the tools that are predominantly used.

                  Figure 4: Memory statistics on a host in human-readable form

                    -
                  • df -- The df command displays disk space usage statistics. The +
                  • df: The df command displays disk space usage statistics. The -i command-line option is also often used to display inode usage statistics. The -h command-line option is used for displaying @@ -2189,13 +2189,13 @@ on). Let's look at some of the tools that are predominantly used.

                    • -

                      sar -- The sar utility monitors various subsystems, such as CPU +

                      sar: The sar utility monitors various subsystems, such as CPU and memory, in real time. This data can be stored in a file specified with the -o option. This tool helps to identify anomalies.

                    • -

                      iftop -- The interface top command (iftop) displays bandwidth +

                      iftop: The interface top command (iftop) displays bandwidth utilization by a host on an interface. This command is often used to identify bandwidth usage by active connections. The -i option specifies which network interface to watch.

                      @@ -2209,31 +2209,31 @@ active connection on the host

                      • -

                        tcpdump -- The tcpdump command is a network monitoring tool that +

                        tcpdump: The tcpdump command is a network monitoring tool that captures network packets flowing over the network and displays a description of the captured packets. The following options are available:

                        • -

                          -i <interface> -- Interface to listen on

                          +

                          -i <interface>: Interface to listen on

                        • -

                          host <IP/hostname> -- Filters traffic going to or from the +

                          host <IP/hostname>: Filters traffic going to or from the specified host

                        • -

                          src/dst -- Displays one-way traffic from the source (src) or to +

                          src/dst: Displays one-way traffic from the source (src) or to the destination (dst)

                        • -

                          port <port number> -- Filters traffic to or from a particular +

                          port <port number>: Filters traffic to or from a particular port

                      tcpdump of packets on an interface

                      -

                      Figure 7: *tcpdump* of packets on *docker0* +

                      Figure 7: tcpdump of packets on docker0 interface on a host

                      diff --git a/level101/metrics_and_monitoring/conclusion/index.html b/level101/metrics_and_monitoring/conclusion/index.html index c3fb268..4e3051a 100644 --- a/level101/metrics_and_monitoring/conclusion/index.html +++ b/level101/metrics_and_monitoring/conclusion/index.html @@ -2151,12 +2151,12 @@

                      Conclusion

                      A robust monitoring and alerting system is necessary for maintaining and troubleshooting a system. A dashboard with key metrics can give you an -overview of service performance, all in one place. Well-defined alerts +overview of service performance, all in one place. Well-defined alerts (with realistic thresholds and notifications) further enable you to quickly identify any anomalies in the service infrastructure and in resource saturation. By taking necessary actions, you can avoid any service degradations and decrease MTTD for service breakdowns.

                      -

                      In addition to in-house monitoring, monitoring real user experience can +

                      In addition to in-house monitoring, monitoring real-user experience can help you to understand service performance as perceived by the users. Many modules are involved in serving the user, and most of them are out of your control. Therefore, you need to have real-user monitoring in diff --git a/level101/metrics_and_monitoring/introduction/index.html b/level101/metrics_and_monitoring/introduction/index.html index ad01f3f..f1fbafe 100644 --- a/level101/metrics_and_monitoring/introduction/index.html +++ b/level101/metrics_and_monitoring/introduction/index.html @@ -2208,7 +2208,7 @@ a system, analyzing the data to derive meaningful information, and displaying the data to the users. In simple terms, you measure various metrics regularly to understand the state of the system, including but not limited to, user requests, latency, and error rate. What gets -measured, gets fixed---if you can measure something, you can reason +measured, gets fixed—if you can measure something, you can reason about it, understand it, discuss it, and act upon it with confidence.

                      Four golden signals of monitoring

                      When setting up monitoring for a system, you need to decide what to @@ -2237,7 +2237,7 @@ if you can measure only four metrics of your service, focus on these four. Let's look at each of the four golden signals.

                      • -

                        Traffic -- Traffic gives a better understanding of the service +

                        TrafficTraffic gives a better understanding of the service demand. Often referred to as service QPS (queries per second), traffic is a measure of requests served by the service. This signal helps you to decide when a service needs to be scaled up to @@ -2245,7 +2245,7 @@ four. Let's look at each of the four golden signals.

                        cost-effective.

                      • -

                        Latency -- Latency is the measure of time taken by the service +

                        LatencyLatency is the measure of time taken by the service to process the incoming request and send the response. Measuring service latency helps in the early detection of slow degradation of the service. Distinguishing between the latency of successful @@ -2258,7 +2258,7 @@ four. Let's look at each of the four golden signals.

                        overall latency might result in misleading calculations.

                      • -

                        Error (rate) -- Error is the measure of failed client +

                        Error (rate)Error is the measure of failed client requests. These failures can be easily identified based on the response codes (HTTP 5XX error). @@ -2274,7 +2274,7 @@ four. Let's look at each of the four golden signals.

                        in place to capture errors in addition to the response codes.

                      • -

                        Saturation -- Saturation is a measure of the resource +

                        SaturationSaturation is a measure of the resource utilization by a service. This signal tells you the state of service resources and how full they are. These resources include memory, compute, network I/O, and so on. Service performance @@ -2306,19 +2306,19 @@ can build intelligent applications to address specific needs. Some of the key use cases follow:

                        • -

                          Reduction in time to resolve issues -- With a good monitoring +

                          Reduction in time to resolve issues—With a good monitoring infrastructure in place, you can identify issues quickly and resolve them, which reduces the impact caused by the issues.

                        • -

                          Business decisions -- Data collected over a period of time can +

                          Business decisions—Data collected over a period of time can help you make business decisions such as determining the product release cycle, which features to invest in, and geographical areas to focus on. Decisions based on long-term data can improve the overall product experience.

                        • -

                          Resource planning -- By analyzing historical data, you can +

                          Resource planning—By analyzing historical data, you can forecast service compute-resource demands, and you can properly allocate resources. This allows financially effective decisions, with no compromise in end-user experience.

                          @@ -2328,44 +2328,44 @@ the key use cases follow:

                          terminologies.

                          • -

                            Metric -- A metric is a quantitative measure of a particular - system attribute---for example, memory or CPU

                            +

                            Metric—A metric is a quantitative measure of a particular + system attribute—for example, memory or CPU

                          • -

                            Node or host -- A physical server, virtual machine, or container +

                            Node or host—A physical server, virtual machine, or container where an application is running

                          • -

                            QPS -- Queries Per Second, a measure of traffic served by the +

                            QPSQueries Per Second, a measure of traffic served by the service per second

                          • -

                            Latency -- The time interval between user action and the - response from the server---for example, time spent after sending a +

                            Latency—The time interval between user action and the + response from the server—for example, time spent after sending a query to a database before the first response bit is received

                          • -

                            Error rate -- Number of errors observed over a particular +

                            Error rate—Number of errors observed over a particular time period (usually a second)

                          • -

                            Graph -- In monitoring, a graph is a representation of one or +

                            Graph—In monitoring, a graph is a representation of one or more values of metrics collected over time

                          • -

                            Dashboard -- A dashboard is a collection of graphs that provide +

                            Dashboard—A dashboard is a collection of graphs that provide an overview of system health

                          • -

                            Incident -- An incident is an event that disrupts the normal +

                            Incident—An incident is an event that disrupts the normal operations of a system

                          • -

                            MTTD -- Mean Time To Detect is the time interval between the +

                            MTTDMean Time To Detect is the time interval between the beginning of a service failure and the detection of such failure

                          • -

                            MTTR -- Mean Time To Resolve is the time spent to fix a service +

                            MTTR—Mean Time To Resolve is the time spent to fix a service failure and bring the service back to its normal state

                          @@ -2382,7 +2382,7 @@ notifying concerned parties during any abnormal behavior. Let's look at each of these infrastructure components:

                          • -

                            Host metrics agent -- A host metrics agent is a process +

                            Host metrics agent—A host metrics agent is a process running on the host that collects performance statistics for host subsystems such as memory, CPU, and network. These metrics are regularly relayed to a metrics collector for storage and @@ -2392,7 +2392,7 @@ each of these infrastructure components:

                            and metricbeat.

                          • -

                            Metric aggregator -- A metric aggregator is a process running +

                            Metric aggregator—A metric aggregator is a process running on the host. Applications running on the host collect service metrics using instrumentation. @@ -2403,7 +2403,7 @@ each of these infrastructure components:

                            StatsD.

                          • -

                            Metrics collector -- A metrics collector process collects all +

                            Metrics collector—A metrics collector process collects all the metrics from the metric aggregators running on multiple hosts. The collector takes care of decoding and stores this data on the database. Metric collection and storage might be taken care of by @@ -2413,13 +2413,13 @@ each of these infrastructure components:

                            daemons.

                          • -

                            Storage -- A time-series database stores all of these metrics. +

                            Storage—A time-series database stores all of these metrics. Examples are OpenTSDB, Whisper, and InfluxDB.

                          • -

                            Metrics server -- A metrics server can be as basic as a web +

                            Metrics server—A metrics server can be as basic as a web server that graphically renders metric data. In addition, the metrics server provides aggregation functionalities and APIs for fetching metric data programmatically. Some examples are @@ -2427,7 +2427,7 @@ each of these infrastructure components:

                            Graphite-Web.

                          • -

                            Alert manager -- The alert manager regularly polls metric data +

                            Alert manager—The alert manager regularly polls metric data available and, if there are any anomalies detected, notifies you. Each alert has a set of rules for identifying such anomalies. Today many metrics servers such as diff --git a/level101/metrics_and_monitoring/observability/index.html b/level101/metrics_and_monitoring/observability/index.html index cdfcf77..c5dd6aa 100644 --- a/level101/metrics_and_monitoring/observability/index.html +++ b/level101/metrics_and_monitoring/observability/index.html @@ -2107,7 +2107,7 @@

                            Observability

                            Engineers often use observability when referring to building reliable -systems. Observability is a term derived from control theory, It is a +systems. Observability is a term derived from control theory, it is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Service infrastructures used on a daily basis are becoming more and more complex; proactive monitoring @@ -2177,7 +2177,7 @@ value and take further action.

                            Logstash, Kibana), which provides centralized log processing. Beats is a collection of lightweight data shippers that can ship logs, audit data, network data, and so on over the network. In this use case specifically, -we are using filebeat as a log shipper. Filebeat watches service log +we are using Filebeat as a log shipper. Filebeat watches service log files and ships the log data to Logstash. Logstash parses these logs and transforms the data, preparing it to store on Elasticsearch. Transformed log data is stored on Elasticsearch and indexed for fast retrieval. diff --git a/level101/metrics_and_monitoring/third-party_monitoring/index.html b/level101/metrics_and_monitoring/third-party_monitoring/index.html index bc111bb..3bc0b0b 100644 --- a/level101/metrics_and_monitoring/third-party_monitoring/index.html +++ b/level101/metrics_and_monitoring/third-party_monitoring/index.html @@ -2111,13 +2111,13 @@ addition, a number of companies such as Datadog offer monitoring-as-a-service. In this section, we are not covering monitoring-as-a-service in depth.

                            -

                            In recent years, more and more people have access to the internet. Many +

                            In recent years, more and more people have access to the Internet. Many services are offered online to cater to the increasing user base. As a result, web pages are becoming larger, with increased client-side scripts. Users want these services to be fast and error-free. From the service point of view, when the response body is composed, an HTTP 200 OK response is sent, and everything looks okay. But there might be -errors during transmission or on the client side. As previously +errors during transmission or on the client-side. As previously mentioned, monitoring services from within the service infrastructure give good visibility into service health, but this is not enough. You need to monitor user experience, specifically the availability of @@ -2131,7 +2131,7 @@ service is globally accessible. Other third-party monitoring solutions for real user monitoring (RUM) provide performance statistics such as service uptime and response time, from different geographical locations. This allows you to monitor the user experience from these locations, -which might have different internet backbones, different operating +which might have different Internet backbones, different operating systems, and different browsers and browser versions. Catchpoint Global Monitoring Network is a diff --git a/level101/python_web/intro/index.html b/level101/python_web/intro/index.html index 6094c53..7473ddf 100644 --- a/level101/python_web/intro/index.html +++ b/level101/python_web/intro/index.html @@ -2261,17 +2261,17 @@

                            Python and The Web

                            Prerequisites

                              -
                            • Basic understanding of python language.
                            • -
                            • Basic familiarity with flask framework.
                            • +
                            • Basic understanding of Python language.
                            • +
                            • Basic familiarity with Flask framework.

                            What to expect from this course

                            -

                            This course is divided into two high level parts. In the first part, assuming familiarity with python language’s basic operations and syntax usage, we will dive a little deeper into understanding python as a language. We will compare python with other programming languages that you might already know like Java and C. We will also explore concepts of Python objects and with help of that, explore python features like decorators.

                            -

                            In the second part which will revolve around the web, and also assume familiarity with the Flask framework, we will start from the socket module and work with HTTP requests. This will demystify how frameworks like flask work internally.

                            -

                            And to introduce SRE flavour to the course, we will design, develop and deploy (in theory) a URL shortening application. We will emphasize parts of the whole process that are more important as an SRE of the said app/service.

                            +

                            This course is divided into two high-level parts. In the first part, assuming familiarity with Python language’s basic operations and syntax usage, we will dive a little deeper into understanding Python as a language. We will compare Python with other programming languages that you might already know like Java and C. We will also explore concepts of Python objects and with help of that, explore Python features like decorators.

                            +

                            In the second part which will revolve around the web, and also assume familiarity with the Flask framework, we will start from the socket module and work with HTTP requests. This will demystify how frameworks like Flask work internally.

                            +

                            And to introduce SRE flavour to the course, we will design, develop and deploy (in theory) a URL-shortening application. We will emphasize parts of the whole process that are more important as an SRE of the said app/service.

                            What is not covered under this course

                            -

                            Extensive knowledge of python internals and advanced python.

                            +

                            Extensive knowledge of Python internals and advanced Python.

                            Lab Environment Setup

                            -

                            Have latest version of python installed

                            +

                            Have latest version of Python installed

                            Course Contents

                            1. The Python Language
                                @@ -2284,7 +2284,7 @@
                              1. Flask
                            2. -
                            3. The URL Shortening App
                                +
                              1. The URL-Shortening App
                                1. Design
                                2. Scaling The App
                                3. Monitoring The App
                                4. @@ -2292,11 +2292,11 @@

                                The Python Language

                                -

                                Assuming you know a little bit of C/C++ and Java, let's try to discuss the following questions in context of those two languages and python. You might have heard that C/C++ is a compiled language while python is an interpreted language. Generally, with compiled language we first compile the program and then run the executable while in case of python we run the source code directly like python hello_world.py. While Java, being an interpreted language, still has a separate compilation step and then its run. So what's really the difference?

                                +

                                Assuming you know a little bit of C/C++ and Java, let's try to discuss the following questions in context of those two languages and Python. You might have heard that C/C++ is a compiled language while Python is an interpreted language. Generally, with compiled language we first compile the program and then run the executable while in case of Python we run the source code directly like python hello_world.py. While Java, being an interpreted language, still has a separate compilation step and then it's run. So, what's really the difference?

                                Compiled vs. Interpreted

                                -

                                This might sound a little weird to you: python, in a way is a compiled language! Python has a compiler built-in! It is obvious in the case of java since we compile it using a separate command ie: javac helloWorld.java and it will produce a .class file which we know as a bytecode. Well, python is very similar to that. One difference here is that there is no separate compile command/binary needed to run a python program.

                                -

                                What is the difference then, between java and python? -Well, Java's compiler is more strict and sophisticated. As you might know Java is a statically typed language. So the compiler is written in a way that it can verify types related errors during compile time. While python being a dynamic language, types are not known until a program is run. So in a way, python compiler is dumb (or, less strict). But there indeed is a compile step involved when a python program is run. You might have seen python bytecode files with .pyc extension. Here is how you can see bytecode for a given python program.

                                +

                                This might sound a little weird to you: Python, in a way is a compiled language! Python has a compiler built-in! It is obvious in the case of Java since we compile it using a separate command, ie: javac helloWorld.java and it will produce a .class file which we know as a bytecode. Well, Python is very similar to that. One difference here is that there is no separate compile command/binary needed to run a Python program.

                                +

                                What is the difference then, between Java and Python? +Well, Java's compiler is more strict and sophisticated. As you might know Java is a statically typed language. So the compiler is written in a way that it can verify types-related errors during compile time. While Python being a dynamic language, types are not known until a program is run. So in a way, Python compiler is dumb (or, less strict). But there indeed is a compile step involved when a Python program is run. You might have seen Python bytecode files with .pyc extension. Here is how you can see bytecode for a given Python program.

                                # Create a Hello World
                                 $ echo "print('hello world')" > hello_world.py
                                 
                                @@ -2313,11 +2313,11 @@ $ python -m dis hello_world.py
                                              8 LOAD_CONST               1 (None)
                                             10 RETURN_VALUE
                                 
                                -

                                Read more about dis module here

                                -

                                Now coming to C/C++, there of course is a compiler. But the output is different than what java/python compiler would produce. Compiling a C program would produce what we also know as machine code. As opposed to bytecode.

                                +

                                Read more about dis module here.

                                +

                                Now coming to C/C++, there of course is a compiler. But the output is different than what Java/Python compiler would produce. Compiling a C program would produce what we also know as machine code, as opposed to bytecode.

                                Running The Programs

                                We know compilation is involved in all 3 languages we are discussing. Just that the compilers are different in nature and they output different types of content. In case of C/C++, the output is machine code which can be directly read by your operating system. When you execute that program, your OS will know how exactly to run it. But this is not the case with bytecode.

                                -

                                Those bytecodes are language specific. Python has its own set of bytecode defined (more in dis module) and so does java. So naturally, your operating system will not know how to run it. To run this bytecode, we have something called Virtual Machines. Ie: The JVM or the Python VM (CPython, Jython). These so called Virtual Machines are the programs which can read the bytecode and run it on a given operating system. Python has multiple VMs available. Cpython is a python VM implemented in C language, similarly Jython is a Java implementation of python VM. At the end of the day, what they should be capable of is to understand python language syntax, be able to compile it to bytecode and be able to run that bytecode. You can implement a python VM in any language! (And people do so, just because it can be done)

                                +

                                Those bytecodes are language specific. Python has its own set of bytecode defined (more in dis module) and so does Java. So naturally, your operating system will not know how to run it. To run this bytecode, we have something called Virtual Machines. Ie: The JVM or the Python VM (CPython, Jython). These so-called Virtual Machines are the programs which can read the bytecode and run it on a given operating system. Python has multiple VMs available. CPython is a Python VM implemented in C language, similarly Jython is a Java implementation of Python VM. At the end of the day, what they should be capable of is to understand Python language syntax, be able to compile it to bytecode and be able to run that bytecode. You can implement a Python VM in any language! (And people do so, just because it can be done)

                                                                                              The Operating System
                                 
                                                                                               +------------------------------------+
                                @@ -2351,7 +2351,7 @@ hello_world.c                     OS Specific machinecode     |         A New Pr
                                 

                                Two things to note for above diagram:

                                  -
                                1. Generally, when we run a python program, a python VM process is started which reads the python source code, compiles it to byte code and run it in a single step. Compiling is not a separate step. Shown only for illustration purpose.
                                2. +
                                3. Generally, when we run a Python program, a Python VM process is started which reads the Python source code, compiles it to bytecode and run it in a single step. Compiling is not a separate step. Shown only for illustration purpose.
                                4. Binaries generated for C like languages are not exactly run as is. Since there are multiple types of binaries (eg: ELF), there are more complicated steps involved in order to run a binary but we will not go into that since all that is done at OS level.
                                diff --git a/level101/python_web/python-concepts/index.html b/level101/python_web/python-concepts/index.html index 660fe38..61d0d01 100644 --- a/level101/python_web/python-concepts/index.html +++ b/level101/python_web/python-concepts/index.html @@ -2217,10 +2217,10 @@

                                Some Python Concepts

                                -

                                Though you are expected to know python and its syntax at basic level, let us discuss some fundamental concepts that will help you understand the python language better.

                                +

                                Though you are expected to know python and its syntax at basic level, let us discuss some fundamental concepts that will help you understand the Python language better.

                                Everything in Python is an object.

                                -

                                That includes the functions, lists, dicts, classes, modules, a running function (instance of function definition), everything. In the CPython, it would mean there is an underlying struct variable for each object.

                                -

                                In python's current execution context, all the variables are stored in a dict. It'd be a string to object mapping. If you have a function and a float variable defined in the current context, here is how it is handled internally.

                                +

                                That includes the functions, lists, dicts, classes, modules, a running function (instance of function definition), everything. In the CPython, it would mean there is an underlying struct variable for each object.

                                +

                                In Python's current execution context, all the variables are stored in a dict. It'd be a string to object mapping. If you have a function and a float variable defined in the current context, here is how it is handled internally.

                                >>> float_number=42.0
                                 >>> def foo_func():
                                 ...     pass
                                @@ -2242,7 +2242,7 @@
                                 '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
                                 '__subclasshook__']
                                 
                                -

                                While there are a lot of them, let's look at some interesting ones

                                +

                                While there are a lot of them, let's look at some interesting ones.

                                globals

                                This attribute, as the name suggests, has references of global variables. If you ever need to know what all global variables are in the scope of this function, this will tell you. See how the function start seeing the new variable in globals

                                >>> hello.__globals__
                                @@ -2254,7 +2254,7 @@
                                 {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'hello': <function hello at 0x7fe4e82554c0>, 'GLOBAL': 'g_val'}
                                 

                                code

                                -

                                This is an interesting one! As everything in python is an object, this includes the bytecode too. The compiled python bytecode is a python code object. Which is accessible via __code__ attribute here. A function has an associated code object which carries some interesting information.

                                +

                                This is an interesting one! As everything in Python is an object, this includes the bytecode too. The compiled Python bytecode is a Python code object. Which is accessible via __code__ attribute here. A function has an associated code object which carries some interesting information.

                                # the file in which function is defined
                                 # stdin here since this is run in an interpreter
                                 >>> hello.__code__.co_filename
                                @@ -2272,9 +2272,9 @@
                                 >>> hello.__code__.co_code
                                 b't\x00d\x01|\x00\x9b\x00d\x02\x9d\x03\x83\x01\x01\x00d\x00S\x00'
                                 
                                -

                                There are more code attributes which you can enlist by >>> dir(hello.__code__)

                                +

                                There are more code attributes which you can enlist by >>> dir(hello.__code__).

                                Decorators

                                -

                                Related to functions, python has another feature called decorators. Let's see how that works, keeping everything is an object in mind.

                                +

                                Related to functions, Python has another feature called decorators. Let's see how that works, keeping everything is an object in mind.

                                Here is a sample decorator:

                                >>> def deco(func):
                                 ...     def inner():
                                @@ -2304,7 +2304,7 @@ after
                                 
                              2. Function hello_world is created
                              3. It is passed to deco function
                              4. deco create a new function
                                  -
                                1. This new function is calls hello_world function
                                2. +
                                3. This new function calls hello_world function
                                4. And does a couple other things
                              5. @@ -2343,10 +2343,10 @@ after

                                Note how the hello_world name points to a new function object but that new function object knows the reference (ID) of the original function.

                                Some Gotchas

                                  -
                                • While it is very quick to build prototypes in python and there are tons of libraries available, as the codebase complexity increases, type errors become more common and will get hard to deal with. (There are solutions to that problem like type annotations in python. Checkout mypy.)
                                • -
                                • Because python is dynamically typed language, that means all types are determined at runtime. And that makes python run very slow compared to other statically typed languages.
                                • +
                                • While it is very quick to build prototypes in Python and there are tons of libraries available, as the codebase complexity increases, type errors become more common and will get hard to deal with. (There are solutions to that problem like type annotations in Python. Checkout mypy.)
                                • +
                                • Because Python is dynamically typed language, that means all types are determined at runtime. And that makes Python run very slow compared to other statically typed languages.
                                • Python has something called GIL (global interpreter lock) which is a limiting factor for utilizing multiple CPU cores for parallel computation.
                                • -
                                • Some weird things that python does: https://github.com/satwikkansal/wtfpython
                                • +
                                • Some weird things that Python does: https://github.com/satwikkansal/wtfpython.
                                diff --git a/level101/python_web/python-web-flask/index.html b/level101/python_web/python-web-flask/index.html index c249b21..4ac53c3 100644 --- a/level101/python_web/python-web-flask/index.html +++ b/level101/python_web/python-web-flask/index.html @@ -2163,10 +2163,10 @@

                                Python, Web and Flask

                                -

                                Back in the old days, websites were simple. They were simple static html contents. A webserver would be listening on a defined port and according to the HTTP request received, it would read files from disk and return them in response. But since then, complexity has evolved and websites are now dynamic. Depending on the request, multiple operations need to be performed like reading from database or calling other API and finally returning some response (HTML data, JSON content etc.)

                                -

                                Since serving web requests is no longer a simple task like reading files from disk and return contents, we need to process each http request, perform some operations programmatically and construct a response.

                                +

                                Back in the old days, websites were simple. They were simple static html contents. A webserver would be listening on a defined port and according to the HTTP request received, it would read files from disk and return them in response. But since then, complexity has evolved and websites are now dynamic. Depending on the request, multiple operations need to be performed like reading from database or calling other API and finally returning some response (HTML data, JSON content, etc.)

                                +

                                Since serving web requests is no longer a simple task like reading files from disk and return contents, we need to process each HTTP request, perform some operations programmatically and construct a response.

                                Sockets

                                -

                                Though we have frameworks like flask, HTTP is still a protocol that works over TCP protocol. So let us setup a TCP server and send an HTTP request and inspect the request's payload. Note that this is not a tutorial on socket programming but what we are doing here is inspecting HTTP protocol at its ground level and look at what its contents look like. (Ref: Socket Programming in Python (Guide) on RealPython)

                                +

                                Though we have frameworks like Flask, HTTP is still a protocol that works over TCP protocol. So, let us setup a TCP server and send an HTTP request and inspect the request's payload. Note that this is not a tutorial on socket programming but what we are doing here is inspecting HTTP protocol at its ground level and look at what its contents look like. (Ref: Socket Programming in Python (Guide) on RealPython)

                                import socket
                                 
                                 HOST = '127.0.0.1'  # Standard loopback interface address (localhost)
                                @@ -2184,7 +2184,7 @@ with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
                                                break
                                            print(data)
                                 
                                -

                                Then we open localhost:65432 in our web browser and following would be the output:

                                +

                                Then, we open localhost:65432 in our web browser and following would be the output:

                                Connected by ('127.0.0.1', 54719)
                                 b'GET / HTTP/1.1\r\nHost: localhost:65432\r\nConnection: keep-alive\r\nDNT: 1\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.44\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nSec-Fetch-Site: none\r\nSec-Fetch-Mode: navigate\r\nSec-Fetch-User: ?1\r\nSec-Fetch-Dest: document\r\nAccept-Encoding: gzip, deflate, br\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n'
                                 
                                @@ -2195,10 +2195,10 @@ HEADERS_SEPARATED_BY_SEPARATOR

                                So though it's a blob of bytes, knowing http protocol specification, you can parse that string (ie: split by \r\n) and get meaningful information out of it.

                                Flask

                                Flask, and other such frameworks does pretty much what we just discussed in the last section (with added more sophistication). They listen on a port on a TCP socket, receive an HTTP request, parse the data according to protocol format and make it available to you in a convenient manner.

                                -

                                ie: you can access headers in flask by request.headers which is made available to you by splitting above payload by /r/n, as defined in http protocol.

                                -

                                Another example: we register routes in flask by @app.route("/hello"). What flask will do is maintain a registry internally which will map /hello with the function you decorated with. Now whenever a request comes with the /hello route (second component in the first line, split by space), flask calls the registered function and returns whatever the function returned.

                                +

                                That is you can access headers in Flask by request.headers which is made available to you by splitting above payload by /r/n, as defined in HTTP protocol.

                                +

                                Another example: we register routes in Flask by @app.route("/hello"). What Flask will do is maintain a registry internally which will map /hello with the function you decorated with. Now, whenever a request comes with the /hello route (second component in the first line, split by space), Flask calls the registered function and returns whatever the function returned.

                                Same with all other web frameworks in other languages too. They all work on similar principles. What they basically do is understand the HTTP protocol, parses the HTTP request data and gives us programmers a nice interface to work with HTTP requests.

                                -

                                Not so much of magic, innit?

                                +

                                Not so much of magic in it?

                                diff --git a/level101/python_web/sre-conclusion/index.html b/level101/python_web/sre-conclusion/index.html index 93e693c..c18d619 100644 --- a/level101/python_web/sre-conclusion/index.html +++ b/level101/python_web/sre-conclusion/index.html @@ -2207,33 +2207,33 @@

                                Conclusion

                                Scaling The App

                                The design and development is just a part of the journey. We will need to setup continuous integration and continuous delivery pipelines sooner or later. And we have to deploy this app somewhere.

                                -

                                Initially we can start with deploying this app on one virtual machine on any cloud provider. But this is a Single point of failure which is something we never allow as an SRE (or even as an engineer). So an improvement here can be having multiple instances of applications deployed behind a load balancer. This certainly prevents problems of one machine going down.

                                +

                                Initially, we can start with deploying this app on one virtual machine on any cloud provider. But this is a Single point of failure which is something we never allow as an SRE (or even as an engineer). So an improvement here can be having multiple instances of applications deployed behind a load balancer. This certainly prevents problems of one machine going down.

                                Scaling here would mean adding more instances behind the load balancer. But this is scalable upto only a certain point. After that, other bottlenecks in the system will start appearing. ie: DB will become the bottleneck, or perhaps the load balancer itself. How do you know what is the bottleneck? You need to have observability into each aspects of the application architecture.

                                Only after you have metrics, you will be able to know what is going wrong where. What gets measured, gets fixed!

                                Get deeper insights into scaling from School Of SRE's Scalability module and post going through it, apply your learnings and takeaways to this app. Think how will we make this app geographically distributed and highly available and scalable.

                                Monitoring Strategy

                                -

                                Once we have our application deployed. It will be working ok. But not forever. Reliability is in the title of our job and we make systems reliable by making the design in a certain way. But things still will go down. Machines will fail. Disks will behave weirdly. Buggy code will get pushed to production. And all these possible scenarios will make the system less reliable. So what do we do? We monitor!

                                +

                                Once we have our application deployed. It will be working okay. But not forever. Reliability is in the title of our job and we make systems reliable by making the design in a certain way. But things still will go down. Machines will fail. Disks will behave weirdly. Buggy code will get pushed to production. And all these possible scenarios will make the system less reliable. So what do we do? We monitor!

                                We keep an eye on the system's health and if anything is not going as expected, we want ourselves to get alerted.

                                -

                                Now let's think in terms of the given url shortening app. We need to monitor it. And we would want to get notified in case something goes wrong. But we first need to decide what is that something that we want to keep an eye on.

                                +

                                Now let's think in terms of the given URL-shortening app. We need to monitor it. And we would want to get notified in case something goes wrong. But we first need to decide what is that something that we want to keep an eye on.

                                1. Since it's a web app serving HTTP requests, we want to keep an eye on HTTP Status codes and latencies
                                2. Request volume again is a good candidate, if the app is receiving an unusual amount of traffic, something might be off.
                                3. -
                                4. We also want to keep an eye on the database so depending on the database solution chosen. Query times, volumes, disk usage etc.
                                5. +
                                6. We also want to keep an eye on the database so depending on the database solution chosen. Query times, volumes, disk usage, etc.
                                7. Finally, there also needs to be some external monitoring which runs periodic tests from devices outside of your data centers. This emulates customers and ensures that from customer point of view, the system is working as expected.

                                Applications in SRE role

                                -

                                In the world of SRE, python is a widely used language. For small scripts and tooling developed for various purposes. Since tooling developed by SRE works with critical pieces of infrastructure and has great power (to bring things down), it is important to know what you are doing while using a programming language and its features. Also it is equally important to know the language and its characteristics while debugging the issues. As an SRE having a deeper understanding of python language, it has helped me a lot to debug very sneaky bugs and be generally more aware and informed while making certain design decisions.

                                +

                                In the world of SRE, Python is a widely used language for small scripts and tooling developed for various purposes. Since tooling developed by SRE works with critical pieces of infrastructure and has great power (to bring things down), it is important to know what you are doing while using a programming language and its features. Also it is equally important to know the language and its characteristics while debugging the issues. As an SRE having a deeper understanding of Python language, it has helped me a lot to debug very sneaky bugs and be generally more aware and informed while making certain design decisions.

                                While developing tools may or may not be part of SRE job, supporting tools or services is more likely to be a daily duty. Building an application or tool is just a small part of productionization. While there is certainly that goes in the design of the application itself to make it more robust, as an SRE you are responsible for its reliability and stability once it is deployed and running. And to ensure that, you’d need to understand the application first and then come up with a strategy to monitor it properly and be prepared for various failure scenarios.

                                Optional Exercises

                                1. Make a decorator that will cache function return values depending on input parameters.
                                2. -
                                3. Host the URL shortening app on any cloud provider.
                                4. -
                                5. Setup monitoring using many of the tools available like catchpoint, datadog etc.
                                6. -
                                7. Create a minimal flask-like framework on top of TCP sockets.
                                8. +
                                9. Host the URL-shortening app on any cloud provider.
                                10. +
                                11. Setup monitoring using many of the tools available like Catchpoint, Datadog, etc.
                                12. +
                                13. Create a minimal Flask-like framework on top of TCP sockets.

                                Conclusion

                                -

                                This module, in the first part, aims to make you more aware of the things that will happen when you choose python as your programming language and what happens when you run a python program. With the knowledge of how python handles things internally as objects, lot of seemingly magic things in python will start to make more sense.

                                -

                                The second part will first explain how a framework like flask works using the existing knowledge of protocols like TCP and HTTP. It then touches the whole lifecycle of an application development lifecycle including the SRE parts of it. While the design and areas in architecture considered will not be exhaustive, it will give a good overview of things that are also important being an SRE and why they are important.

                                +

                                This module, in the first part, aims to make you more aware of the things that will happen when you choose Python as your programming language and what happens when you run a Python program. With the knowledge of how Python handles things internally as objects, lot of seemingly magic things in Python will start to make more sense.

                                +

                                The second part will first explain how a framework like Flask works using the existing knowledge of protocols like TCP and HTTP. It then touches the whole lifecycle of an application development lifecycle including the SRE parts of it. While the design and areas in architecture considered will not be exhaustive, it will give a good overview of things that are also important being an SRE and why they are important.

                                diff --git a/level101/python_web/url-shorten-app/index.html b/level101/python_web/url-shorten-app/index.html index b7c230b..16b0020 100644 --- a/level101/python_web/url-shorten-app/index.html +++ b/level101/python_web/url-shorten-app/index.html @@ -2231,17 +2231,17 @@

                                The URL Shortening App

                                -

                                Let's build a very simple URL shortening app using flask and try to incorporate all aspects of the development process including the reliability aspects. We will not be building the UI and we will come up with a minimal set of API that will be enough for the app to function well.

                                +

                                Let's build a very simple URL-shortening app using Flask and try to incorporate all aspects of the development process including the reliability aspects. We will not be building the UI and we will come up with a minimal set of API that will be enough for the app to function well.

                                Design

                                We don't jump directly to coding. First thing we do is gather requirements. Come up with an approach. Have the approach/design reviewed by peers. Evolve, iterate, document the decisions and tradeoffs. And then finally implement. While we will not do the full blown design document here, we will raise certain questions here that are important to the design.

                                1. High Level Operations and API Endpoints

                                -

                                Since it's a URL shortening app, we will need an API for generating the shorten link given an original link. And an API/Endpoint which will accept the shorten link and redirect to original URL. We are not including the user aspect of the app to keep things minimal. These two API should make app functional and usable by anyone.

                                +

                                Since it's a URL-shortening app, we will need an API for generating the shorten link given an original link. And an API/Endpoint which will accept the shorten link and redirect to original URL. We are not including the user aspect of the app to keep things minimal. These two API should make app functional and usable by anyone.

                                2. How to shorten?

                                -

                                Given a url, we will need to generate a shortened version of it. One approach could be using random characters for each link. Another thing that can be done is to use some sort of hashing algorithm. The benefit here is we will reuse the same hash for the same link. ie: if lot of people are shortening https://www.linkedin.com they all will have the same value, compared to multiple entries in DB if chosen random characters.

                                -

                                What about hash collisions? Even in random characters approach, though there is a less probability, hash collisions can happen. And we need to be mindful of them. In that case we might want to prepend/append the string with some random value to avoid conflict.

                                +

                                Given a URL, we will need to generate a shortened version of it. One approach could be using random characters for each link. Another thing that can be done is to use some sort of hashing algorithm. The benefit here is we will reuse the same hash for the same link. Ie: if lot of people are shortening https://www.linkedin.com, they all will have the same value, compared to multiple entries in DB if chosen random characters.

                                +

                                What about hash collisions? Even in random characters approach, though there is a less probability, hash collisions can happen. And we need to be mindful of them. In that case, we might want to prepend/append the string with some random value to avoid conflict.

                                Also, choice of hash algorithm matters. We will need to analyze algorithms. Their CPU requirements and their characteristics. Choose one that suits the most.

                                3. Is URL Valid?

                                -

                                Given a URL to shorten, how do we verify if the URL is valid? Do we even verify or validate? One basic check that can be done is see if the URL matches a regex of a URL. To go even further we can try opening/visiting the URL. But there are certain gotchas here.

                                +

                                Given a URL to shorten, how do we verify if the URL is valid? Do we even verify or validate? One basic check that can be done is see if the URL matches a regex of a URL. To go even further, we can try opening/visiting the URL. But there are certain gotchas here.

                                1. We need to define success criteria. ie: HTTP 200 means it is valid.
                                2. What if the URL is in private network?
                                3. @@ -2250,10 +2250,9 @@

                                  4. Storage

                                  Finally, storage. Where will we store the data that we will generate over time? There are multiple database solutions available and we will need to choose the one that suits this app the most. Relational database like MySQL would be a fair choice but be sure to checkout School of SRE's SQL database section and NoSQL databases section for deeper insights into making a more informed decision.

                                  5. Other

                                  -

                                  We are not accounting for users into our app and other possible features like rate limiting, customized links etc but it will eventually come up with time. Depending on the requirements, they too might need to get incorporated.

                                  -

                                  The minimal working code is given below for reference but I'd encourage you to come up with your own.

                                  +

                                  We are not accounting for users into our app and other possible features like rate limiting, customized links, etc. but it will eventually come up with time. Depending on the requirements, they too might need to get incorporated.

                                  +

                                  The minimal working code is given below for reference, but I'd encourage you to come up with your own.

                                  from flask import Flask, redirect, request
                                  -
                                   from hashlib import md5
                                   
                                   app = Flask("url_shortener")
                                  diff --git a/level101/security/conclusion/index.html b/level101/security/conclusion/index.html
                                  index 26e5690..4fe6620 100644
                                  --- a/level101/security/conclusion/index.html
                                  +++ b/level101/security/conclusion/index.html
                                  @@ -2168,21 +2168,33 @@
                                   

                                  Other Resources

                                  Some books that would be a great resource

                                  Post Training asks/ Further Reading

                                  diff --git a/level101/security/fundamentals/index.html b/level101/security/fundamentals/index.html index ffc6140..22705db 100644 --- a/level101/security/fundamentals/index.html +++ b/level101/security/fundamentals/index.html @@ -2352,33 +2352,35 @@
                                4. They have quite a big role in System design & hence are quite sometimes the first line of defence.
                                5. SRE’s help in preventing bad design & implementations which can affect the overall security of the infrastructure.
                                6. Successfully designing, implementing, and maintaining systems requires a commitment to the full system lifecycle. This commitment is possible only when security and reliability are central elements in the architecture of systems.
                                7. -
                                8. Core Pillars of Information Security :
                                9. -
                                10. Confidentiality – only allow access to data for which the user is permitted
                                11. -
                                12. Integrity – ensure data is not tampered or altered by unauthorized users
                                13. -

                                  Availability – ensure systems and data are available to authorized users when they need it

                                  -
                                14. -
                                15. -

                                  Thinking like a Security Engineer

                                  -
                                16. -
                                17. -

                                  When starting a new application or re-factoring an existing application, you should consider each functional feature, and consider:

                                  +

                                  Core Pillars of Information Security:

                                    +
                                  • Confidentiality—only allow access to data for which the user is permitted
                                  • +
                                  • Integrity—ensure data is not tampered or altered by unauthorized users
                                  • +
                                  • Availability—ensure systems and data are available to authorized users when they need it
                                  • +
                                  +
                                18. +
                                19. +

                                  Thinking like a Security Engineer:

                                  +
                                    +
                                  • When starting a new application or re-factoring an existing application, you should consider each functional feature, and consider:
                                    • Is the process surrounding this feature as safe as possible? In other words, is this a flawed process?
                                    • If I were evil, how would I abuse this feature? Or more specifically failing to address how a feature can be abused can cause design flaws.
                                    • Is the feature required to be on by default? If so, are there limits or options that could help reduce the risk from this feature?
                                  • +
                                  +
                                20. Security Principles By OWASP (Open Web Application Security Project)

                                21. -
                                22. Minimize attack surface area :
                                    +
                                  • Minimize attack surface area:
                                    • Every feature that is added to an application adds a certain amount of risk to the overall application. The aim of secure development is to reduce the overall risk by reducing the attack surface area.
                                    • For example, a web application implements online help with a search function. The search function may be vulnerable to SQL injection attacks. If the help feature was limited to authorized users, the attack likelihood is reduced. If the help feature’s search function was gated through centralized data validation routines, the ability to perform SQL injection is dramatically reduced. However, if the help feature was re-written to eliminate the search function (through a better user interface, for example), this almost eliminates the attack surface area, even if the help feature was available to the Internet at large.
                                  • Establish secure defaults:
                                      -
                                    • There are many ways to deliver an “out of the box” experience for users. However, by default, the experience should be secure, and it should be up to the user to reduce their security – if they are allowed.
                                    • +
                                    • There are many ways to deliver an “out of the box” experience for users. However, by default, the experience should be secure, and it should be up to the user to reduce their security—if they are allowed.
                                    • For example, by default, password ageing and complexity should be enabled. Users might be allowed to turn these two features off to simplify their use of the application and increase their risk.
                                    • Default Passwords of routers, IoT devices should be changed
                                    @@ -2397,19 +2399,19 @@
                                  • Fail securely

                                      -
                                    • Applications regularly fail to process transactions for many reasons. How they fail can determine if an application is secure or not.
                                    • +
                                    • Applications regularly fail to process transactions for many reasons. How they fail can determine if an application is secure or not. +
                                      
                                      +  is_admin = true;
                                      +  try {
                                      +    code_which_may_fail();
                                      +    is_admin = is_user_assigned_role("Adminstrator");
                                      +  }
                                      +  catch (Exception err) {
                                      +    log.error(err.toString());
                                      +  }
                                      +  
                                    • +
                                    • If either codeWhichMayFail() or isUserInRole fails or throws an exception, the user is an admin by default. This is obviously a security risk.
                                    -

                                    ```

                                    -

                                    is_admin = true; -try { - code_which_may_faile(); - is_admin = is_user_assigned_role("Adminstrator"); -} -catch (Exception err) { -log.error(err.toString()); -}

                                    -

                                    ``` -- If either codeWhichMayFail() or isUserInRole fails or throws an exception, the user is an admin by default. This is obviously a security risk.

                                  • Don’t trust services

                                    @@ -2422,7 +2424,7 @@ log.error(err.toString());
                                  • Separation of duties
                                    • The key to fraud control is the separation of duties. For example, someone who requests a computer cannot also sign for it, nor should they directly receive the computer. This prevents the user from requesting many computers and claiming they never arrived.
                                    • Certain roles have different levels of trust than normal users. In particular, administrators are different from normal users. In general, administrators should not be users of the application.
                                    • -
                                    • For example, an administrator should be able to turn the system on or off, set password policy but shouldn’t be able to log on to the storefront as a super privileged user, such as being able to “buy” goods on behalf of other users.
                                    • +
                                    • For example, an administrator should be able to turn the system on or off, set password policy but shouldn't be able to log on to the storefront as a super privileged user, such as being able to "buy" goods on behalf of other users.
                                  • Avoid security by obscurity
                                      @@ -2444,7 +2446,7 @@ log.error(err.toString());
                                    • Reliability & Security
                                      • Reliability and security are both crucial components of a truly trustworthy system, but building systems that are both reliable and secure is difficult. While the requirements for reliability and security share many common properties, they also require different design considerations. It is easy to miss the subtle interplay between reliability and security that can cause unexpected outcomes
                                      • -
                                      • Ex: A password management application failure was triggered by a reliability problem i.e poor load-balancing and load-shedding strategies and its recovery were later complicated by multiple measures (HSM mechanism which needs to be plugged into server racks, which works as an authentication & the HSM token supposedly locked inside a case.. & the problem can be further elongated ) designed to increase the security of the system.
                                      • +
                                      • Ex: A password management application failure was triggered by a reliability problem i.e poor load-balancing and load-shedding strategies and its recovery were later complicated by multiple measures (HSM mechanism which needs to be plugged into server racks, which works as an authentication & the HSM token supposedly locked inside a case.. & the problem can be further elongated) designed to increase the security of the system.
                                    @@ -2465,7 +2467,7 @@ log.error(err.toString());

                                  OpenID/OAuth

                                  OpenID is an authentication protocol that allows us to authenticate users without using a local auth system. In such a scenario, a user has to be registered with an OpenID Provider and the same provider should be integrated with the authentication flow of your application. To verify the details, we have to forward the authentication requests to the provider. On successful authentication, we receive a success message and/or profile details with which we can execute the necessary flow.

                                  -

                                  OAuth is an authorization mechanism that allows your application user access to a provider(Gmail/Facebook/Instagram/etc). On successful response, we (your application) receive a token with which the application can access certain APIs on behalf of a user. OAuth is convenient in case your business use case requires some certain user-facing APIs like access to Google Drive or sending tweets on your behalf. Most OAuth 2.0 providers can be used for pseudo authentication. Having said that, it can get pretty complicated if you are using multiple OAuth providers to authenticate users on top of the local authentication system.

                                  +

                                  OAuth is an authorization mechanism that allows your application user access to a provider (Gmail/Facebook/Instagram/etc). On successful response, we (your application) receive a token with which the application can access certain APIs on behalf of a user. OAuth is convenient in case your business use case requires some certain user-facing APIs like access to Google Drive or sending tweets on your behalf. Most OAuth 2.0 providers can be used for pseudo authentication. Having said that, it can get pretty complicated if you are using multiple OAuth providers to authenticate users on top of the local authentication system.


                                  Cryptography

                                    @@ -2493,14 +2495,14 @@ D(k,E(k,m)) = m
                                23. Stream Ciphers:

                                    -
                                  • The message is broken into characters or bits and enciphered with a key or keystream(should be random and generated independently of the message stream) that is as long as the plaintext bitstream.
                                  • +
                                  • The message is broken into characters or bits and enciphered with a key or keystream (should be random and generated independently of the message stream) that is as long as the plaintext bitstream.
                                  • If the keystream is random, this scheme would be unbreakable unless the keystream was acquired, making it unconditionally secure. The keystream must be provided to both parties in a secure way to prevent its release.

                                  Block Ciphers:

                                    -
                                  • Block ciphers — process messages in blocks, each of which is then encrypted or decrypted.
                                  • +
                                  • Block ciphers—process messages in blocks, each of which is then encrypted or decrypted.
                                  • -

                                    A block cipher is a symmetric cipher in which blocks of plaintext are treated as a whole and used to produce ciphertext blocks. The block cipher takes blocks that are b bits long and encrypts them to blocks that are also b bits long. Block sizes are typically 64 or 128 bits long.

                                    +

                                    A block cipher is a symmetric cipher in which blocks of plaintext are treated as a whole and used to produce ciphertext blocks. The block cipher takes blocks that are b bits long and encrypts them to blocks that are also b bits long. Block sizes are typically 64 or 128 bits long.

                                    image5 image6

                                  • @@ -2508,7 +2510,7 @@ D(k,E(k,m)) = m

                                    Encryption

                                    • Secret Key (Symmetric Key): the same key is used for encryption and decryption
                                    • -
                                    • Public Key (Asymmetric Key) in an asymmetric, the encryption and decryption keys are different but related. The encryption key is known as the public key and the decryption key is known as the private key. The public and private keys are known as a key pair.
                                    • +
                                    • Public Key (Asymmetric Key): in an asymmetric, the encryption and decryption keys are different but related. The encryption key is known as the public key and the decryption key is known as the private key. The public and private keys are known as a key pair.

                                    Symmetric Key Encryption

                                    DES

                                    @@ -2556,7 +2558,7 @@ D(k,E(k,m)) = m

                                    NOTE: In terms of TLS key exchange, this is the common approach.

                                    Diffie-Hellman

                                      -
                                    • The protocol has two system parameters, p and g. They are both public and may be used by everybody. Parameter p is a prime number, and parameter g (usually called a generator) is an integer that is smaller than p, but with the following property: For every number n between 1 and p – 1 inclusive, there is a power k of g such that n = gk mod p.
                                    • +
                                    • The protocol has two system parameters, p and g. They are both public and may be used by everybody. Parameter p is a prime number, and parameter g (usually called a generator) is an integer that is smaller than p, but with the following property: For every number, n between 1 and p – 1 inclusive, there is a power k of g such that n = gk mod p.
                                    • Diffie Hellman algorithm is an asymmetric algorithm used to establish a shared secret for a symmetric key algorithm. Nowadays most of the people use hybrid cryptosystem i.e, a combination of symmetric and asymmetric encryption. Asymmetric Encryption is used as a technique in key exchange mechanism to share a secret key and after the key is shared between sender and receiver, the communication will take place using symmetric encryption. The shared secret key will be used to encrypt the communication.
                                    • Refer: https://medium.com/@akhigbemmanuel/what-is-the-diffie-hellman-key-exchange-algorithm-84d60025a30d
                                    @@ -2578,10 +2580,12 @@ D(k,E(k,m)) = m
                                  • More:

                                    -
                                  • + + +

                                  MD5

                                  • MD5 is a one-way function with which it is easy to compute the hash from the given input data, but it is unfeasible to compute input data given only a hash.
                                  • @@ -2651,8 +2655,10 @@ D(k,E(k,m)) = m
                                  • a client: A user/ a service
                                  • a server: Kerberos protected hosts reside

                                    -

                                    image10 - - a Key Distribution Center (KDC), which acts as the trusted third-party authentication service.

                                    +

                                    image10

                                    +
                                  • +
                                  • +

                                    a Key Distribution Center (KDC), which acts as the trusted third-party authentication service.

                                  The KDC includes the following two servers:

                                  @@ -2664,23 +2670,24 @@ D(k,E(k,m)) = m

                          Certificate Chain

                          -

                          The first part of the output of the OpenSSL command shows three certificates numbered 0, 1, and 2(not 2 anymore). Each certificate has a subject, s, and an issuer, i. The first certificate, number 0, is called the end-entity certificate. The subject line tells us it’s valid for any subdomain of google.com because its subject is set to *.google.com.

                          -

                          $ openssl s_client -connect www.google.com:443 -CApath /etc/ssl/certs +

                          The first part of the output of the OpenSSL command shows three certificates numbered 0, 1, and 2 (not 2 anymore). Each certificate has a subject, s, and an issuer, i. The first certificate, number 0, is called the end-entity certificate. The subject line tells us it’s valid for any subdomain of google.com because its subject is set to *.google.com.

                          +
                          $ openssl s_client -connect www.google.com:443 -CApath /etc/ssl/certs
                           CONNECTED(00000005)
                           depth=2 OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
                           verify return:1
                           depth=1 C = US, O = Google Trust Services, CN = GTS CA 1O1
                           verify return:1
                           depth=0 C = US, ST = California, L = Mountain View, O = Google LLC, CN = www.google.com
                          -verify return:1
                          ----
                          +verify return:1`
                          +`---
                           Certificate chain
                            0 s:/C=US/ST=California/L=Mountain View/O=Google LLC/CN=www.google.com
                              i:/C=US/O=Google Trust Services/CN=GTS CA 1O1
                            1 s:/C=US/O=Google Trust Services/CN=GTS CA 1O1
                              i:/OU=GlobalSign Root CA - R2/O=GlobalSign/CN=GlobalSign
                          ----
                          -Server certificate

                          +--- +
                          +

                          Server certificate

                          • The issuer line indicates it’s issued by Google Internet Authority G2, which also happens to be the subject of the second certificate, number 1
                          • What the OpenSSL command line doesn’t show here is the trust store that contains the list of CA certificates trusted by the system OpenSSL runs on.
                          • @@ -2696,14 +2703,14 @@ Certificate chain
                            1. The client sends a HELLO message to the server with a list of protocols and algorithms it supports.
                            2. The server says HELLO back and sends its chain of certificates. Based on the capabilities of the client, the server picks a cipher suite.
                            3. -
                            4. If the cipher suite supports ephemeral key exchange, like ECDHE does(ECDHE is an algorithm known as the Elliptic Curve Diffie-Hellman Exchange), the server and the client negotiate a pre-master key with the Diffie-Hellman algorithm. The pre-master key is never sent over the wire.
                            5. +
                            6. If the cipher suite supports ephemeral key exchange, like ECDHE does (ECDHE is an algorithm known as the Elliptic Curve Diffie-Hellman Exchange), the server and the client negotiate a pre-master key with the Diffie-Hellman algorithm. The pre-master key is never sent over the wire.
                            7. The client and server create a session key that will be used to encrypt the data transiting through the connection.
                            -

                            At the end of the handshake, both parties possess a secret session key used to encrypt data for the rest of the connection. This is what OpenSSL refers to as Master-Key

                            +

                            At the end of the handshake, both parties possess a secret session key used to encrypt data for the rest of the connection. This is what OpenSSL refers to as Master-Key.

                            NOTE

                              -
                            • There are 3 versions of TLS , TLS 1.0, 1.1 & 1.2
                            • -
                            • TLS 1.0 was released in 1999, making it a nearly two-decade-old protocol. It has been known to be vulnerable to attacks—such as BEAST and POODLE—for years, in addition to supporting weak cryptography, which doesn’t keep modern-day connections sufficiently secure.
                            • +
                            • There are 3 versions of TLS, TLS 1.0, 1.1 & 1.2
                            • +
                            • TLS 1.0 was released in 1999, making it a nearly two-decade-old protocol. It has been known to be vulnerable to attacks—such as BEAST and POODLE—for years, in addition to supporting weak cryptography, which doesn’t keep modern-day connections sufficiently secure.
                            • TLS 1.1 is the forgotten “middle child.” It also has bad cryptography like its younger sibling. In most software, it was leapfrogged by TLS 1.2 and it’s rare to see TLS 1.1 used.

                            “Perfect” Forward Secrecy

                            @@ -2712,9 +2719,11 @@ Certificate chain
                          • In a non-ephemeral key exchange, the client sends the pre-master key to the server by encrypting it with the server’s public key. The server then decrypts the pre-master key with its private key. If at a later point in time, the private key of the server is compromised, an attacker can go back to this handshake, decrypt the pre-master key, obtain the session key, and decrypt the entire traffic. Non-ephemeral key exchanges are vulnerable to attacks that may happen in the future on recorded traffic. And because people seldom change their password, decrypting data from the past may still be valuable for an attacker.
                          • An ephemeral key exchange like DHE, or its variant on elliptic curve, ECDHE, solves this problem by not transmitting the pre-master key over the wire. Instead, the pre-master key is computed by both the client and the server in isolation, using nonsensitive information exchanged publicly. Because the pre-master key can’t be decrypted later by an attacker, the session key is safe from future attacks: hence, the term perfect forward secrecy.
                          • Keys are changed every X blocks along the stream. That prevents an attacker from simply sniffing the stream and applying brute force to crack the whole thing. "Forward secrecy" means that just because I can decrypt block M, does not mean that I can decrypt block Q
                          • -
                          • Downside:
                          • +
                          • Downside:
                            • The downside to PFS is that all those extra computational steps induce latency on the handshake and slow the user down. To avoid repeating this expensive work at every connection, both sides cache the session key for future use via a technique called session resumption. This is what the session-ID and TLS ticket are for: they allow a client and server that share a session ID to skip over the negotiation of a session key, because they already agreed on one previously, and go directly to exchanging data securely.
                            +
                          • +
                          diff --git a/level101/security/intro/index.html b/level101/security/intro/index.html index 8766725..48454e4 100644 --- a/level101/security/intro/index.html +++ b/level101/security/intro/index.html @@ -2201,7 +2201,7 @@

    What to expect from this course

    -

    The course covers fundamentals of information security along with touching on subjects of system security, network & web security. This course aims to get you familiar with the basics of information security in day to day operations & then as an SRE develop the mindset of ensuring that security takes a front-seat while developing solutions. The course also serves as an introduction to common risks and best practices along with practical ways to find out vulnerable systems and loopholes which might become compromised if not secured.

    +

    The course covers fundamentals of information security along with touching on subjects of system security, network & web security. This course aims to get you familiar with the basics of information security in day-to-day operations and then as an SRE develop the mindset of ensuring that security takes a front-seat while developing solutions. The course also serves as an introduction to common risks and best practices along with practical ways to find out vulnerable systems and loopholes which might become compromised if not secured.

    What is not covered under this course

    The courseware is not an ethical hacking workshop or a very deep dive into the fundamentals of the problems. The course does not deal with hacking or breaking into systems but rather an approach on how to ensure you don’t get into those situations and also to make you aware of different ways a system can be compromised.

    Course Contents

    diff --git a/level101/security/network_security/index.html b/level101/security/network_security/index.html index 76ed874..a0606ea 100644 --- a/level101/security/network_security/index.html +++ b/level101/security/network_security/index.html @@ -1285,10 +1285,10 @@
  • - Network Perimeter Security + Network Perimeter Security -
  • - + - + - + - + - + - + - + @@ -2376,38 +2376,38 @@
    99%(Two Nines)99% (Two Nines) 3.65 days 7.31 hours 1.68 hours 14.40 minutes
    99.5%(Two and a half Nines)99.5% (Two and a half Nines) 1.83 days 3.65 hours 50.40 minutes 7.20 minutes
    99.9%(Three Nines)99.9% (Three Nines) 8.77 hours 43.83 minutes 10.08 minutes 1.44 minutes
    99.95%(Three and a half Nines)99.95% (Three and a half Nines) 4.38 hours 21.92 minutes 5.04 minutes 43.20 seconds
    99.99%(Four Nines)99.99% (Four Nines) 52.60 minutes 4.38 minutes 1.01 minutes 8.64 seconds
    99.995%(Four and a half Nines)99.995% (Four and a half Nines) 26.30 minutes 2.19 minutes 30.24 seconds 4.32 seconds
    99.999%(Five Nines)99.999% (Five Nines) 5.26 minutes 26.30 seconds 6.05 seconds

    Refer

    HA - Availability Serial Components

    -

    A System with components is operating in the series If the failure of a part leads to the combination becoming inoperable.

    +

    A System with components is operating in the series if the failure of a part leads to the combination becoming inoperable.

    For example, if LB in our architecture fails, all access to app tiers will fail. LB and app tiers are connected serially.

    -

    The combined availability of the system is the product of individual components availability

    +

    The combined availability of the system is the product of individual components availability:

    A = Ax x Ay x …..

    Refer

    HA - Availability Parallel Components

    -

    A System with components is operating in parallel If the failure of a part leads to the other part taking over the operations of the failed part.

    -

    If we have more than one LB and if the rest of the LBs can take over the traffic during one LB failure then LBs are operating in parallel

    +

    A System with components is operating in parallel if the failure of a part leads to the other part taking over the operations of the failed part.

    +

    If we have more than one LB and if the rest of the LBs can take over the traffic during one LB failure, then LBs are operating in parallel.

    The combined availability of the system is

    A = 1 - ( (1-Ax) x (1-Ax) x ….. )

    Refer

    HA - Core Principles

    Elimination of single points of failure (SPOF) This means adding redundancy to the system so that the failure of a component does not mean failure of the entire system.

    Reliable crossover In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.

    -

    Detection of failures as they occur If the two principles above are observed, then a user may never see a failure

    +

    Detection of failures as they occur If the two principles above are observed, then a user may never see a failure.

    Refer

    HA - SPOF

    WHAT: Never implement and always eliminate single points of failure.

    WHEN TO USE: During architecture reviews and new designs.

    -

    HOW TO USE: Identify single instances on architectural diagrams. Strive for active/active configurations. At the very least we should have a standby to take control when active instances fail.

    +

    HOW TO USE: Identify single instances on architectural diagrams. Strive for active/active configurations. At the very least, we should have a standby to take control when active instances fail.

    WHY: Maximize availability through multiple instances.

    KEY TAKEAWAYS: Strive for active/active rather than active/passive solutions. Use load balancers to balance traffic across instances of a service. Use control services with active/passive instances for patterns that require singletons.

    HA - Reliable Crossover

    @@ -2415,16 +2415,16 @@

    WHEN TO USE: During architecture reviews, failure modeling, and designs.

    HOW TO USE: Identify how available a system is during the crossover and ensure it is within acceptable limits.

    WHY: Maximize availability and ensure data handling semantics are preserved.

    -

    KEY TAKEAWAYS: Strive for active/active rather than active/passive solutions, they have a lesser risk of cross over being unreliable. Use LB and the right load balancing methods to ensure reliable failover. Model and build your data systems to ensure data is correctly handled when crossover happens. Generally, DB systems follow active/passive semantics for writes. Masters accept writes and when the master goes down, the follower is promoted to master(active from being passive) to accept writes. We have to be careful here that the cutover never introduces more than one master. This problem is called a split brain.

    +

    KEY TAKEAWAYS: Strive for active/active rather than active/passive solutions, they have a lesser risk of cross over being unreliable. Use LB and the right load-balancing methods to ensure reliable failover. Model and build your data systems to ensure data is correctly handled when crossover happens. Generally, DB systems follow active/passive semantics for writes. Masters accept writes and when the master goes down, the follower is promoted to master (active from being passive) to accept writes. We have to be careful here that the cutover never introduces more than one master. This problem is called a split brain.

    Applications in SRE role

    1. SRE works on deciding an acceptable SLA and make sure the system is available to achieve the SLA
    2. SRE is involved in architecture design right from building the data center to make sure the site is not affected by a network switch, hardware, power, or software failures
    3. SRE also run mock drills of failures to see how the system behaves in uncharted territory and comes up with a plan to improve availability if there are misses. -https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear
    4. +https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear
    -

    Post our understanding about HA, our architecture diagram looks something like this below -HA Block Diagram

    +

    Post our understanding about HA, our architecture diagram looks something like this below:

    +

    HA Block Diagram

    diff --git a/level101/systems_design/conclusion/index.html b/level101/systems_design/conclusion/index.html index bb74b69..6d9c846 100644 --- a/level101/systems_design/conclusion/index.html +++ b/level101/systems_design/conclusion/index.html @@ -2105,7 +2105,7 @@

    Conclusion

    -

    Armed with these principles, we hope the course will give a fresh perspective to design software systems. It might be over-engineering to get all this on day zero. But some are really important from day 0 like eliminating single points of failure, making scalable services by just increasing replicas. As a bottleneck is reached, we can split code by services, shard data to scale. As the organization matures, bringing in chaos engineering to measure how systems react to failure will help in designing robust software systems.

    +

    Armed with these principles, we hope the course will give a fresh perspective to design software systems. It might be over-engineering to get all this on day zero. But some are really important from day 0 like eliminating single points of failure, making scalable services by just increasing replicas. As a bottleneck is reached, we can split code by services, shard data to scale. As the organization matures, bringing in chaos engineering to measure how systems react to failure will help in designing robust software systems.

    diff --git a/level101/systems_design/fault-tolerance/index.html b/level101/systems_design/fault-tolerance/index.html index 00bdf79..c53485a 100644 --- a/level101/systems_design/fault-tolerance/index.html +++ b/level101/systems_design/fault-tolerance/index.html @@ -1063,10 +1063,10 @@
  • - Fault Tolerance - Failure Metrics + Fault Tolerance: Failure Metrics -