Apache HDFS
HADOOP DISTRIBUTED
FILE SYSTEM
HDFS is a distributed
file system that is designed for storing large data files. HDFS is a Java-based
file system that provides scalable and reliable data storage, and it was
designed to span large clusters of commodity servers. HDFS has demonstrated
production scalability of up to 200 PB of storage and a single cluster of 4500
servers, supporting close to a billion files and blocks. HDFS is a scalable,
fault-tolerant, distributed storage system that works closely with a wide
variety of concurrent data access applications, coordinated by YARN. HDFS will
“just work” under a variety of physical and systemic circumstances. By
distributing storage and computation across many servers, the combined storage
resource can grow linearly with demand while remaining economical at every
amount of storage.
An HDFS cluster is comprised of a NameNode,
which manages the cluster metadata, and DataNodes that store the data. Files
and directories are represented on the NameNode by inodes. Inodes record
attributes like permissions, modification and access times, or namespace and
disk space quotas.
The file content is split into large blocks
(typically 128 megabytes), and each block of the file is independently
replicated at multiple DataNodes. The blocks are stored on the local file
system on the DataNodes.
The Namenode actively monitors the number of
replicas of a block. When a replica of a block is lost due to a DataNode failure
or disk failure, the NameNode creates another replica of the block. The
NameNode maintains the namespace tree and the mapping of blocks to DataNodes,
holding the entire namespace image in RAM.
The NameNode does not directly send requests
to DataNodes. It sends instructions to the DataNodes by replying to heartbeats
sent by those DataNodes. The instructions include commands to:
·
replicate blocks to
other nodes,
·
remove local block
replicas,
·
re-register and send
an immediate block report, or
·
shut down the node.


Comments
Post a Comment