Apache Yarn
Hadoop
HDFS is the data storage layer for Hadoop and MapReduce was the data-processing
layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn’t
sufficient for the very wide variety of use-cases we see Hadoop being employed
to solve. Hadoop 2.0 presents YARN, as a generic resource-management and
distributed application framework, whereby, one can implement multiple data
processing applications customized for the task at hand. The fundamental idea
of YARN is to split up the two major responsibilities of the JobTracker i.e.
resource management and job scheduling/monitoring, into separate daemons: a
global ResourceManager and per-application ApplicationMaster (AM).
The
ResourceManager and per-node slave, the NodeManager (NM), form the new, and
generic, system for managing applications in a
distributed manner.
The
ResourceManager is the ultimate authority that arbitrates resources among all
the applications in the system. The per-application ApplicationMaster is, in
effect, a framework specific entity and is tasked with negotiating
resources from the ResourceManager and working with the NodeManager(s) to
execute and monitor the component tasks.
ResourceManager has a pluggable Scheduler,
which is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues etc. The
Scheduler is a pure scheduler in the sense that it performs no
monitoring or tracking of status for the application, offering no guarantees on
restarting failed tasks either due to application failure or hardware failures.
The Scheduler performs its scheduling function based on the resource
requirements of the
applications; it does so based on the abstract notion of a **Resource Container **which incorporates resource elements
such as memory, CPU, disk, network etc.
NodeManager is the per-machine slave, which is
responsible for launching the applications’ containers, monitoring their
resource usage (CPU, memory, disk, network) and reporting the same to the
ResourceManager.
The
per-application ApplicationMaster has the responsibility of negotiating
appropriate resource containers from the Scheduler, tracking their status and
monitoring for progress. From the system perspective, the ApplicationMaster
itself runs as a normal container.
Here is an architectural view of YARN:
One of
the crucial implementation details for MapReduce within the new YARN system that I’d like to point out is that we
have reused the existing MapReduce framework without any major surgery. This was
very important to ensure compatibility for existing MapReduce applications
and users.

Comments
Post a Comment