A cluster in computing is a group of nodes — physical machines or virtual instances — connected over a network and operated as if they were a single logical system. The cluster is the substrate of distributed systems: storage clusters hold more data than any one machine could, compute clusters do more work than any one CPU could.
Two things a cluster makes possible that a single machine can’t:
- Storage beyond one disk — split the dataset into pieces, put different pieces on different nodes. The cluster appears to hold the whole dataset; no individual node does.
- Computation beyond one CPU — hand pieces of the work to different nodes, run them in parallel, combine the results.
Clusters have failure modes that single machines don’t. With one node, the node either works or it doesn’t, and the binary is the whole story. With a thousand nodes, some node is almost always failing — disks die, network cables get unplugged, machines need rebooting for security patches. A cluster that assumes all its nodes are healthy will spend most of its time broken; a cluster designed for the actual failure pattern (replicate data across multiple nodes, retry failed tasks elsewhere, monitor health continuously) keeps working through the noise.
A cluster running Apache Hadoop consists of NameNodes (metadata bookkeepers) and DataNodes (where the actual data lives), coordinated through YARN for resource management. Other distributed systems have similar role splits — coordinator nodes plus worker nodes — but the names and details vary.
Two scaling strategies sit behind clusters. Vertical scaling (scale up) buys a bigger machine — more cores, more RAM, faster disks — and stays on a single host. Horizontal scaling (scale out) adds more machines, accepting the coordination cost in return for near-unlimited headroom. Clusters are the horizontal-scaling answer, and the design payoff is that they run on commodity hardware — ordinary servers, not specialized supercomputer nodes. Hadoop in particular was built on the premise that a thousand cheap boxes (with the failure rate that implies) can outdo one expensive box, provided the software handles failure gracefully.
Clusters are the substrate for Big data in the same way single machines are the substrate for medium-sized data and files are for small data. Each step up in scale brings new techniques and new failure modes.