A DataNode is a worker node in an HDFS cluster — one of the many machines that actually store the bytes. Each DataNode stores HDFS blocks on its local disk and reads or writes them when a client asks. The cluster’s storage capacity is essentially the sum of its DataNodes’ disks (divided by the Replication factor for safety).

With a Replication factor of 2 and a four-node cluster, a typical block distribution might look like:

  • Node 1: blocks 1, 2, 4, 6
  • Node 2: blocks 2, 3, 5
  • Node 3: blocks 1, 5
  • Node 4: blocks 3, 4, 6

Every block exists on at least two nodes, so the loss of any one node doesn’t lose data. The NameNode keeps track of which blocks live where, and clients reading a file get directed to the appropriate DataNodes by the NameNode.

DataNodes report their health to the NameNode periodically. When a DataNode goes silent for long enough, the NameNode considers it lost and starts re-replicating its blocks to other healthy DataNodes, restoring the desired replication factor automatically.

DataNodes accumulate state on local disk between sessions — block files, internal metadata. If you’ve run Hadoop before on the same machine and want to start fresh, the DataNode directory needs cleaning:

rm -rf /Users/<your-computer-name>/hdfs/datanode/*

The rm -rf is recursive and forced; it deletes everything in the directory without prompting. This is the kind of command you only run when you mean it — it’s irreversible, and running it on a working cluster destroys data.

After cleanup, the NameNode is formatted (hadoop namenode -format), all Hadoop daemons are started (start-all.sh), and the cluster is back to a clean state. jps lists the Java processes that should be running — NameNode, DataNode, ResourceManager (for YARN), NodeManager, SecondaryNameNode. If any are missing, the setup has a problem.