Apache Hadoop is open-source software for reliable, scalable, distributed computing. Doug Cutting and Mike Cafarella started it around 2005-2006 (largely developed at Yahoo!), inspired by two Google papers: the Google File System paper (2003) and the MapReduce paper (2004). The name comes from the toy elephant of Cutting’s young son, which is also why Hadoop’s logo is a yellow elephant. It became a top-level Apache project in 2008. The core is written in Java, and Hadoop dominated Big data for roughly a decade. It still sits underneath many modern data platforms.

The ecosystem has grown large, but three components sit at the center:

  • HDFS (Hadoop Distributed File System), the storage layer. Partitions large datasets into blocks and distributes them across the nodes of a cluster, with replication for fault tolerance.
  • MapReduce, the original programming model for distributed data processing. Express computations as a map phase that processes pieces independently, followed by a reduce phase that combines partial results.
  • YARN (Yet Another Resource Negotiator), the resource manager. Schedules jobs, allocates CPU and memory across the cluster, and tracks which work is happening where.

The deeper details (installation, configuration, the dozens of subprojects like Hive, HBase, Pig, Sqoop) belong to the labs and to operational documentation, not to an intro.

The Big Data section is out-of-syllabus for examination. The ideas are still worth understanding because they shape how modern large-scale systems are built, but the operational depth is left for later courses.

Modern systems often use Hadoop mainly for storage (HDFS) and successor frameworks, Apache Spark most notably, for processing. The shape is the same: distributed storage plus distributed compute plus resource management. The newer frameworks just make iterative algorithms (gradient descent, expectation-maximization, the modern ML workhorses) much more efficient than MapReduce did.