Apache Hadoop
Apache Hadoop is an
open source framework for distributed storage and processing of large sets of
data on commodity hardware. Hadoop enables businesses to quickly gain insight
from massive amounts of structured and unstructured data. .
The base Apache Hadoop framework is composed
of the following modules:
·
Hadoop Common – contains libraries and utilities needed by other Hadoop
modules.
·
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data
on commodity machines, providing very high aggregate bandwidth across the
cluster.
·
Hadoop YARN – a
resource-management platform responsible for managing computing resources in
clusters and using them for scheduling of users’ applications.
·
Hadoop MapReduce – a programming model for large scale data processing.
Why Hadoop?
Let us now understand why Hadoop is very popular, why Hadoop has captured more than 90% of big data market.
Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault tolerant (Even if nodes go down, data can be processed by another node) and Open source (can modify the source code if required).
Following characteristics of Hadoop make is a unique platform:
- Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.
- Excels at processing data of complex nature, its scale-out architecture divides workloads across multiple nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
- Scales economically, as discussed it can be deployed on commodity hardware. Apart from this its open-source nature guards against vendor lock.
Hadoop Architecture
After understanding what Hadoop is, let us now see Hadoop architecture.
Hadoop works in master – slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. Master should be deployed on good configuration hardware and not just any commodity hardware as it is the centerpiece of Hadoop cluster.
Master just stores the meta-data (data about data) while slaves are the nodes which store the data. Data is stored distributedly in the cluster. The client connects with master node to perform any task.
Hadoop Daemons
There are mainly 4 daemons which run for Hadoop. Daemons are the processes that run in the background.
- Namenode – It runs on master node for HDFS.
- Datanode – It runs on slave nodes for HDFS.
- ResourceManager – It runs on master node for Yarn.
- NodeManager – It runs on slave node for Yarn.
These 4 daemons run for Hadoop to be functional. Apart from this, there can be secondary NameNode, standby NameNode, Job HistoryServer, etc
How Hadoop works?
Till now we have studied Hadoop Introduction and Hadoop architecture in detail. Now Let us summarize how Hadoop works step by step:
- Step1: Input data is broken into blocks of size 128 Mb (by default) and then blocks are moved to different nodes.
- Step 2: Once all the blocks of the file are stored on datanodes, a user can process the data.
- Step 3: master, then schedules the program (submitted by the user) on individual nodes.
- Step 4: Once all the nodes process the data, output is written back to HDFS
Hadoop Flavors
Below are the various flavors of Hadoop.
- Apache – Vanilla flavor, the actual code is residing in apache repositories.
- Hortonworks – Popular distribution in the industry.
- Cloudera – It is the most popular in the industry.
- MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
- IBM – Proprietary distribution is known as Big Insights.
All the databases have provided native connectivity with Hadoop for fast data transfer. For example, to transfer data from Oracle to Hadoop, you need a connector.
All flavors are almost same and if you know one, you can easily work on other flavors as well.
Hadoop Ecosystem
In this section of Hadoop introduction tutorial, we will cover Hadoop ecosystem. Let us see what all components form the Hadoop Eco-System:
- Hadoop HDFS: Distributed storage layer for Hadoop.
- Yarn: Resource management layer introduced in Hadoop 2.x.
- Hadoop Map-Reduce: Parallel processing layer for Hadoop.
- HBase: HBase is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does not understand structured query. It is well suited for sparse data set.
- Hive: Hive is a data warehousing infrastructure based on Hadoop which enables easy data summarization, using SQL queries.
- Sqoop: Sqoop is a tool designed to transport huge volumes of data between Hadoop and RDBMS.
- Flume: Apache Flume is a reliable system for efficiently collecting large amounts of log data from many different sources in real-time.
- Oozie: Oozie is a Java Web application used to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.
- Pig: Pig is a top-level scripting language that is used with Hadoop. Pigenables writing complex data processing without Java programming.
Each project has been developed to deliver an
explicit function and each has its own community of developers and individual
release cycles. There are five pillars to Hadoop that make it enterprise ready:
Data Management– Store and process vast
quantities of data in a storage layer that scales linearly. Hadoop Distributed
File System (HDFS) is the core technology for the efficient scale out storage
layer, and is designed to run across low-cost commodity hardware. Apache Hadoop
YARN is the pre-requisite for Enterprise Hadoop as it provides the resource
management and pluggable architecture for enabling a wide variety of data
access methods to operate on data stored in Hadoop with predictable performance
and service levels.
·
Apache Hadoop
YARN– Part
of the core Hadoop project, YARN is a next-generation framework for
Hadoop data processing extending MapReduce capabilities by supporting
non-MapReduce workloads associated with other programming models.
·
HDFS– Hadoop Distributed File System (HDFS) is a
Java-based file system that provides scalable and reliable data storage that is
designed to span large clusters of commodity servers.
Data Access– Interact with your data in
a wide variety of ways – from batch to real-time. Apache Hive is the most
widely adopted data access technology, though there are many specialized
engines. For instance, Apache Pig provides scripting capabilities, Apache Storm
offers real-time processing, Apache HBase offers columnar NoSQL storage and
Apache Accumulo offers cell-level access control.
·
Apache Hive– Built on the MapReduce framework, Hive is a
data warehouse that enables easy data summarization and ad-hoc queries via an
SQL-like interface for large datasets stored in HDFS.
·
Apache Pig– A platform for processing and analyzing large
data sets. Pig consists of a high-level language (Pig Latin) for expressing
data analysis programs paired with the MapReduce framework for processing these
programs.
·
MapReduce– MapReduce is a framework for writing
applications that process large amounts of structured and unstructured data in
parallel across a cluster of thousands of machines, in a reliable and
fault-tolerant manner.
·
Apache Spark– Spark is ideal for in-memory data processing.
It allows data scientists to implement fast, iterative algorithms for advanced
analytics such as clustering and classification of datasets.
·
Apache Storm– Storm is a distributed real-time computation
system for processing fast, large streams of data adding reliable real-time
data processing capabilities to Apache Hadoop® 2.x
·
Apache HBase– A column-oriented NoSQL data storage system
that provides random real-time read/write access to big data for user
applications.
·
Apache Tez– Tez generalizes the MapReduce paradigm to a
more powerful framework for executing a complex DAG (directed acyclic graph) of
tasks for near real-time big data processing.
·
Apache Kafka– Kafka is a fast and scalable publish-subscribe
messaging system that is often used in place of traditional message brokers
because of its higher throughput, replication, and fault tolerance.
·
Apache HCatalog– A table and metadata management service that
provides a centralized way for data processing systems to understand the
structure and location of the data stored within Apache Hadoop.
·
Apache Slider– A framework for deployment of long-running
data access applications in Hadoop. Slider leverages YARN’s resource management
capabilities to deploy those applications, to manage their lifecycles and scale
them up or down.
·
Apache Solr– Solr is the open source platform for searches
of data stored in Hadoop. Solr enables powerful full-text search and near
real-time indexing on many of the world’s largest Internet sites.
·
Apache Mahout– Mahout provides scalable machine learning
algorithms for Hadoop which aids with data science for clustering,
classification and batch based collaborative filtering.
·
Apache Accumulo– Accumulo is a high performance data storage
and retrieval system with cell-level access control. It is a scalable
implementation of Google’s Big Table design that works on top of Apache Hadoop
and Apache ZooKeeper.
Data Governance and
Integration–
Quickly and easily load data, and manage according to policy. Workflow Manager
provides workflows for data governance, while Apache Flume and Sqoop enable
easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
·
Workflow Management– Workflow Manager allows you to easily create and schedule
workflows and monitor workflow jobs. It is based on the Apache Oozie workflow
engine that allows users to connect and automate the execution of big data
processing tasks into a defined workflow.
·
Apache Flume– Flume allows you to efficiently aggregate and
move large amounts of log data from many different sources to Hadoop.
·
Apache Sqoop– Sqoop is a tool that speeds and eases movement
of data in and out of Hadoop. It provides a reliable parallel load for various,
popular enterprise data sources.
Security– Address requirements of
Authentication, Authorization, Accounting and Data Protection. Security is
provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the
other Data Access components on up through the entire perimeter of the cluster
via Apache Knox.
·
Apache Knox– The Knox Gateway (“Knox”) provides a single
point of authentication and access for Apache Hadoop services in a cluster. The
goal of the project is to simplify Hadoop security for users who access the
cluster data and execute jobs, and for operators who control access to the
cluster.
·
Apache Ranger– Apache Ranger delivers a comprehensive
approach to security for a Hadoop cluster. It provides central security policy
administration across the core enterprise security requirements of
authorization, accounting and data protection.
Operations– Provision, manage,
monitor and operate Hadoop clusters at scale.
·
Apache Ambari– An open source installation lifecycle
management, administration and monitoring system for Apache Hadoop clusters.
·
Apache Oozie– Oozie Java Web application used to schedule
Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical
unit of work.
·
Apache ZooKeeper– A highly available system for coordinating
distributed processes. Distributed applications use ZooKeeper to store and
mediate updates to important configuration information.



Comments
Post a Comment