HDFS

HDFS (Hadoop Distributed File System) is the storage layer of Apache Hadoop. It partitions a large dataset into smaller, manageable blocks and distributes those blocks across the nodes of a cluster, managing the placement, replication, and retrieval transparently.

A block is typically 128 megabytes in modern HDFS deployments (configurable). For a 10-gigabyte file, HDFS chops it into roughly 80 blocks and spreads them across the cluster. Every block is replicated, by default 3 copies on 3 different machines, so failure of any one node doesn’t lose data. If a node fails for long enough, HDFS notices and re-replicates the affected blocks to other healthy nodes, restoring the desired replication factor without human intervention. This is what fault tolerance and automatic recovery mean in practice.

HDFS follows a write-once, read-many model: a file is written once, then read many times. Appending is supported in modern versions but random in-place writes are not. The design assumes large analytics datasets that don’t get edited after they land, which is why you get bitten trying to update an HDFS file the way you would a local one.

The architecture has two roles:

The NameNode is the bookkeeper. It maintains the metadata that describes which blocks live on which DataNodes and how the blocks fit together to form complete files.
The DataNodes are where the bytes actually live. Each DataNode stores blocks on its local disk and reads or writes them when asked.

When a client reads /user/hadoop/input.txt, it asks the NameNode where the blocks live; the NameNode answers (block 1 is on nodes 1, 3, 7; block 2 is on nodes 1, 2, 5); the client then talks directly to the DataNodes to read the bytes.

Working with HDFS from the command line uses the hdfs dfs tool:

hdfs dfs -mkdir -p /user/hduster
hdfs dfs -put my_dataset.csv /user/hduster
hdfs dfs -get /user/hduster/my_dataset.csv /Users/<local-path>/Documents

These three operations (make a directory, upload a file, download a file) cover most of what’s needed from the shell. For programmatic access from Python, the hdfs library connects to the HDFS REST interface:

from hdfs import InsecureClient
client = InsecureClient('http://localhost:9870/', user='hduster')
with client.read('/user/hduster/file.csv', encoding='utf-8') as r:
    content = r.read()

HDFS is scalable (petabytes across thousands of nodes) and reliable (continues to serve files when hardware fails). Those two properties are what made Hadoop dominate big data. It pairs with MapReduce for distributed computation and YARN for resource management to form the three core layers of Apache Hadoop.

Idriss Rami — Notes

Explorer

HDFS

Graph View

Backlinks