A Hadoop cluster built using Docker Compose, including Hadoop, Postgres (for Hive metastore), Jupyter Notebook, Hive, and Spark.
This repository uses Docker Compose to initialize a Hadoop cluster with the following components:
- Hadoop: Distributed file system (HDFS) with NameNode and DataNode.
- Postgres: Database backend for Hive metastore.
- Jupyter Notebook: Interactive Python environment with PySpark integration.
- Hive: Data warehouse solution for querying large datasets stored in HDFS.
- Spark: Distributed computing framework for big data processing, including Spark Master and Worker nodes.
This setup is ideal for development, testing, and learning purposes. It allows you to experiment with Hadoop, Hive, Spark, and Jupyter in an isolated environment without requiring complex installations.
- Install Docker and Docker Compose.
- Clone this repository and navigate to the project directory:
git clone https://github.com/your-repo/hadoop-docker-cluster.git cd hadoop-docker-cluster - Run the startup script:
./start_demo.sh
- To stop the cluster, run:
./stop_demo.sh
Once the cluster is up and running, you can access the following services via your browser:
-
NameNode UI: http://localhost:9870/dfshealth.html#tab-overview
(Overview of HDFS and its health status) -
DataNode UI: http://localhost:50020/
(Status of the DataNode) -
Spark Master UI: http://localhost:8080/
(Monitor Spark jobs and worker nodes) -
HiveServer2: http://localhost:10000/
(Thrift server for Hive queries) -
Jupyter Notebook: http://localhost:8888/
(Interactive notebook environment with PySpark support)
The Hive CLI allows users to manage and query Hive databases and tables directly from the command line. Below are the steps to interact with the Hive CLI:
-
Access the Hive Server Container
Use the commanddocker exec -it hive-server bashto access the Hive server container. -
Enter the Hive CLI
Execute thehivecommand to enter the Hive CLI. -
List Databases
Runshow databases;to list all available databases. -
Select a Database
Useuse gold;to select thegolddatabase. -
View Tables
Executeshow tables;to display all tables within the selected database.
- The
/datadirectory contains three subdirectories:bronzesilvergold
- The
/user/hive/warehouse/gold.dbdirectory represents the Hive warehouse for thegolddatabase.
Here are some common issues and their solutions:
-
Ports Already in Use:
If any ports (e.g.,9870,8080,8888) are already in use, modify thedocker-compose.ymlfile to use different host ports. -
Configuration Issues:
Ensure all environment variables and mounted volumes are correctly set in thedocker-compose.ymlfile. -
Logs:
Check container logs for errors using the following command:docker logs <container_name>
📧 Email: engibrahimkhalid01@gmail.com
🔗 LinkedIn: Ibrahim Khalid
🐦 Twitter: @IbrahimKhalid_K
For any queries, feel free to reach out!




