Apache Hadoop is open-source software for reliable, scalable, distributed computing. It was started by Doug Cutting and Mike Cafarella around 2005-2006 (largely developed at Yahoo!), inspired by two Google papers: the Google File System paper (2003) and the MapReduce paper (2004). The project’s name comes from the toy elephant of Cutting’s young son — which is also why Hadoop’s logo is a yellow elephant. It became a top-level Apache project in 2008. The core is written in Java, and Hadoop dominated the Big data world for roughly a decade. It remains the substrate beneath many modern data platforms.
The project has grown into a sprawling ecosystem of tools, but three components sit at its heart:
- HDFS (Hadoop Distributed File System) — the storage layer. Partitions large datasets into blocks and distributes them across the nodes of a cluster, with replication for fault tolerance.
- MapReduce — the original programming model for distributed data processing. Express computations as a map phase that processes pieces independently, followed by a reduce phase that combines partial results.
- YARN (Yet Another Resource Negotiator) — the resource manager. Schedules jobs, allocates CPU and memory across the cluster, and tracks which work is happening where.
We’ll look at each lightly. The deeper details — installation, configuration, the dozens of subprojects (Hive, HBase, Pig, Sqoop) — belong to the labs and to operational documentation rather than to a textbook introduction.
The Introduction to Data Science course flags the Big Data section as out-of-syllabus for examination purposes. The ideas are worth understanding because they shape how modern large-scale systems are built, but the operational depth is left for later courses.
Modern systems often use Hadoop primarily for storage (HDFS) and successor computational frameworks — Apache Spark most notably — for processing. The shape is the same: distributed storage plus distributed compute plus resource management. The next generation of frameworks just makes iterative algorithms (gradient descent, expectation-maximization, the modern ML workhorses) much more efficient than MapReduce did.