A DataNode is a worker node in an HDFS cluster, one of the many machines that actually store the bytes. Each DataNode stores HDFS blocks on its local disk and reads or writes them when a client asks. The cluster’s storage capacity is the sum of its DataNodes’ disks (divided by the Replication factor for safety).

With a Replication factor of 2 and a four-node cluster, a typical block distribution might look like:

  • Node 1: blocks 1, 2, 4, 6
  • Node 2: blocks 2, 3, 5
  • Node 3: blocks 1, 5
  • Node 4: blocks 3, 4, 6

Every block exists on at least two nodes, so the loss of any one node doesn’t lose data. The NameNode keeps track of which blocks live where, and clients reading a file get directed to the appropriate DataNodes by the NameNode.

DataNodes report their health to the NameNode periodically. When a DataNode goes silent for long enough, the NameNode considers it lost and starts re-replicating its blocks to other healthy DataNodes, restoring the desired replication factor automatically.

DataNodes accumulate state on local disk between sessions — block files, internal metadata. If you’ve run Hadoop before on the same machine and want to start fresh, the DataNode directory needs cleaning:

rm -rf /Users/<your-computer-name>/hdfs/datanode/*

The rm -rf is recursive and forced; it deletes everything in the directory without prompting. Only run it when you mean it. It’s irreversible, and running it on a working cluster destroys data.

After cleanup, the NameNode is formatted (hadoop namenode -format), all Hadoop daemons are started (start-all.sh), and the cluster is back to a clean state. jps lists the Java processes that should be running: NameNode, DataNode, ResourceManager (for YARN), NodeManager, SecondaryNameNode. If any are missing, the setup has a problem.