A node in a distributed system is an individual machine that participates in the cluster. It can be a physical machine sitting in a rack — its own CPU, RAM, disks, network card — or a virtual instance running in the cloud. From outside the cluster the distinction usually doesn’t matter; what matters is that the node has its own compute and storage resources, and it communicates with other nodes over a network.

When we connect a group of nodes together so they cooperate as if they were one machine, we call the group a cluster. The cluster’s storage capacity is roughly the sum of its nodes’ disks (modulo replication); its computational throughput is roughly the sum of its nodes’ CPUs (modulo coordination overhead).

In a Hadoop cluster running HDFS, nodes split into two roles:

  • The NameNode is the bookkeeper — one (or a few, for high availability) per cluster — that maintains the metadata mapping file paths to block locations.
  • The DataNodes are the workers — many per cluster — that actually store and serve the blocks.

Other distributed systems have analogous role splits. The recurring theme: a small number of coordinator nodes that track what’s where, and a much larger number of worker nodes that hold the data or run the computation.

At cluster scale, individual node failure is the rule, not the exception. With a thousand nodes, some node is almost always offline, rebooting, or misbehaving. Distributed systems are designed around this — data is replicated (in HDFS, every block has a Replication factor of 3), tasks are retried elsewhere when they fail, monitoring keeps watch continuously.