上QQ阅读APP看书，第一时间看更新

Chapter 2. YARN Architecture

This chapter dives deep into YARN architecture its core components, and how they interact to deliver optimal resource utilization, better performance, and manageability. It also focuses on some important terminology concerning YARN.

In this chapter, we will cover the following topics:

Core components of YARN architecture
Interaction and flow of YARN components
ResourceManager scheduling policies
Recent developments in YARN

The motivation behind the YARN architecture is to support more data processing models, such as Apache Spark, Apache Storm, Apache Giraph, Apache HAMA, and so on, than just MapReduce. YARN provides a platform to develop and execute distributed processing applications. It also improves efficiency and resource-sharing capabilities.

The design decision behind YARN architecture is to separate two major functionalities, resource management and job scheduling or monitoring of JobTracker, into separate daemons, that is, a cluster level ResourceManager (RM) and an application-specific ApplicationMaster (AM). YARN architecture follows a master-slave architectural model in which the ResourceManager is the master and node-specific slave NodeManager (NM). The global ResourceManager and per-node NodeManager builds a most generic, scalable, and simple platform for distributed application management. The ResourceManager is the supervisor component that manages the resources among the applications in the whole system. The per-application ApplicationMaster is the application-specific daemon that negotiates resources from ResourceManager and works in hand with NodeManagers to execute and monitor the application's tasks.

The following diagram explains how JobTracker is replaced by a global level ResourceManager and ApplicationManager and a per-node TaskTracker is replaced by an application-level ApplicationMaster to manage its functions and responsibilities. JobTracker and TaskTracker only support MapReduce applications with less scalability and poor cluster utilization. Now, YARN supports multiple distributed data processing models with improved scalability and cluster utilization.

The ResourceManager has a cluster-level scheduler that has responsibility for resource allocation to all the running tasks as per the ApplicationManager's requests. The primary responsibility of the ResourceManager is to allocate resources to the application(s). The ResourceManager is not responsible for tracking the status of an application or monitoring tasks. Also, it doesn't guarantee restarting/balancing tasks in the case of application or hardware failure.

The application-level ApplicationMaster is responsible for negotiating resources from the ResourceManager on application submission, such as memory, CPU, disk, and so on. It is also responsible for tracking an application's status and monitoring application processes in coordination with the NodeManager.

Let's have a look at the high-level architecture of Hadoop 2.0. As you can see, more applications can be supported by YARN than just the MapReduce application. The key component of Hadoop 2 is YARN, for better cluster resource management, and the underlying file system remains the same as Hadoop Distributed File System (HDFS) and is shown in the following image:

Here are some key concepts that we should know before exploring the YARN architecture in detail:

Application: This is the job submitted to the framework, for example a MapReduce job. It could also be a shell script.
Container: This is the basic unit of hardware allocation, for example a container that has 4 GB of RAM and one CPU. The container does optimized resource allocation; this replaces the fixed map and reduce slots in the previous versions of Hadoop.