Deployed 3bfbf06 with MkDocs version: 1.1.2

This commit is contained in:
Kalyanasundaram Somasundaram
2020-11-25 13:38:11 +05:50
parent e20d24fb7f
commit df23fd3354
54 changed files with 596 additions and 421 deletions

View File

@@ -14,7 +14,7 @@
<title>Evolution and Architecure of Hadoop - SchoolOfSRE</title>
<title>Evolution and Architecture of Hadoop - SchoolOfSRE</title>
@@ -36,6 +36,8 @@
<link rel="stylesheet" href="../../stylesheets/custom.css">
@@ -85,7 +87,7 @@
</span>
<span class="md-header-nav__topic md-ellipsis">
Evolution and Architecure of Hadoop
Evolution and Architecture of Hadoop
</span>
</div>
@@ -552,13 +554,13 @@
<input class="md-nav__toggle md-toggle" data-md-toggle="nav-4-1" type="checkbox" id="nav-4-1" >
<label class="md-nav__link" for="nav-4-1">
NoSQL Concepts
NoSQL
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" aria-label="NoSQL Concepts" data-md-level="2">
<nav class="md-nav" aria-label="NoSQL" data-md-level="2">
<label class="md-nav__title" for="nav-4-1">
<span class="md-nav__icon md-icon"></span>
NoSQL Concepts
NoSQL
</label>
<ul class="md-nav__list" data-md-scrollfix>
@@ -653,7 +655,7 @@
<a href="./" class="md-nav__link md-nav__link--active">
Evolution and Architecure of Hadoop
Evolution and Architecture of Hadoop
</a>
</li>
@@ -910,14 +912,14 @@
<li>
<p><strong>HDFS</strong></p>
<ol>
<li>The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. </li>
<li>HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. </li>
<li>The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.</li>
<li>HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.</li>
<li>HDFS is part of the <a href="https://github.com/apache/hadoop">Apache Hadoop Core project</a>.</li>
</ol>
<p><img alt="HDFS Architecture" src="../images/hdfs_architecture.png" /></p>
<ol>
<li>NameNode: is the arbitrator and central repository of file namespace in the cluster. The NameNode executes the operations such as opening, closing, and renaming files and directories.</li>
<li>DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and write requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks.</li>
<li>DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and writes requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks.</li>
<li>Client: Responsible for getting the required metadata from the namenode and then communicating with the datanodes for reads and writes. </br></br></br></li>
</ol>
</li>
@@ -933,30 +935,30 @@
<li>Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components:</li>
<li>Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, which means that it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.</li>
<li>Application manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails.</li>
<li>Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it on the request of the Application master.</li>
<li>Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.</li>
<li>Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc. </br></br></li>
<li>Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep up with the Node Manager. It monitors resource usage, performs log management, and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it at the request of the Application master.</li>
<li>Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.</li>
<li>Container: It is a collection of physical resources such as RAM, CPU cores, and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies, etc. </br></br></li>
</ol>
</li>
</ol>
<h1 id="mapreduce-framework">MapReduce framework</h1>
<p><img alt="MapReduce Framework" src="../images/map_reduce.jpg" /></p>
<ol>
<li>The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key value pairs. Reduce job takes the output of the Map job i.e. the key value pairs and aggregates them to produce desired results. </li>
<li>Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.</li>
<li>Please find the below Word count example demonstrating the usage of MapReduce framework:</li>
<li>The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key-value pairs. Reduce job takes the output of the Map job i.e. the key-value pairs and aggregates them to produce desired results.</li>
<li>Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.</li>
<li>Please find the below Word count example demonstrating the usage of the MapReduce framework:</li>
</ol>
<p><img alt="Word Count Example" src="../images/mapreduce_example.jpg" />
</br></br></p>
<h1 id="other-tooling-around-hadoop">Other tooling around hadoop</h1>
<h1 id="other-tooling-around-hadoop">Other tooling around Hadoop</h1>
<ol>
<li><a href="https://hive.apache.org/"><strong>Hive</strong></a><ol>
<li>Uses a language called HQL which is very SQL like. Gives non-programmers the ability to query and analyze data in Hadoop. Is basically an abstraction layer on top of map-reduce.</li>
<li>Ex. HQL query: <ol>
<li>Ex. HQL query:<ol>
<li><em>SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);</em></li>
</ol>
</li>
<li>In mysql: <ol>
<li>In mysql:<ol>
<li><em>SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;</em></li>
</ol>
</li>
@@ -979,7 +981,7 @@ What is the output of running the pig queries in the right column against the da
3. <a href="https://spark.apache.org/"><strong>Spark</strong></a>
1. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a clusters memory and query it repeatedly, making it well suited to machine learning algorithms.
4. <a href="https://prestodb.io/"><strong>Presto</strong></a>
1. Presto is a high performance, distributed SQL query engine for Big Data.
1. Presto is a high performance, distributed SQL query engine for Big Data.
2. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB.
3. Example presto query:
<code>mysql