Hadoop-Docker-Spark-Hive-DWH-Project

Architecture Diagram

A Hadoop cluster built using Docker Compose, including Hadoop, Postgres (for Hive metastore), Jupyter Notebook, Hive, and Spark.

Introduction

This repository uses Docker Compose to initialize a Hadoop cluster with the following components:

Hadoop: Distributed file system (HDFS) with NameNode and DataNode.
Postgres: Database backend for Hive metastore.
Jupyter Notebook: Interactive Python environment with PySpark integration.
Hive: Data warehouse solution for querying large datasets stored in HDFS.
Spark: Distributed computing framework for big data processing, including Spark Master and Worker nodes.

This setup is ideal for development, testing, and learning purposes. It allows you to experiment with Hadoop, Hive, Spark, and Jupyter in an isolated environment without requiring complex installations.

Quick Start

Prerequisites

Install Docker and Docker Compose.

Starting the Cluster

Clone this repository and navigate to the project directory:

git clone https://github.com/your-repo/hadoop-docker-cluster.git
cd hadoop-docker-cluster

Run the startup script:
```
./start_demo.sh
```
To stop the cluster, run:
```
./stop_demo.sh
```

Interfaces

Once the cluster is up and running, you can access the following services via your browser:

NameNode UI: http://localhost:9870/dfshealth.html#tab-overview
(Overview of HDFS and its health status)
DataNode UI: http://localhost:50020/
(Status of the DataNode)
Spark Master UI: http://localhost:8080/
(Monitor Spark jobs and worker nodes)
HiveServer2: http://localhost:10000/
(Thrift server for Hive queries)
Jupyter Notebook: http://localhost:8888/
(Interactive notebook environment with PySpark support)

DWH Schema

Star Schema

Hive and Hadoop Overview

Hive CLI Interaction

The Hive CLI allows users to manage and query Hive databases and tables directly from the command line. Below are the steps to interact with the Hive CLI:

Access the Hive Server Container
Use the command docker exec -it hive-server bash to access the Hive server container.
Enter the Hive CLI
Execute the hive command to enter the Hive CLI.
List Databases
Run show databases; to list all available databases.
Select a Database
Use use gold; to select the gold database.
View Tables
Execute show tables; to display all tables within the selected database.

Hadoop Web UI: Directory Browser

Root Directory (`/data`)

The /data directory contains three subdirectories:
- bronze
- silver
- gold

Hive Warehouse Directory (`/user/hive/warehouse/gold.db`)

The /user/hive/warehouse/gold.db directory represents the Hive warehouse for the gold database.

Reporting

overview

country

Troubleshooting

Here are some common issues and their solutions:

Ports Already in Use:
If any ports (e.g., 9870, 8080, 8888) are already in use, modify the docker-compose.yml file to use different host ports.
Configuration Issues:
Ensure all environment variables and mounted volumes are correctly set in the docker-compose.yml file.
Logs:
Check container logs for errors using the following command:
```
docker logs <container_name>
```

Contact Information

📧 Email: engibrahimkhalid01@gmail.com
🔗 LinkedIn: Ibrahim Khalid
🐦 Twitter: @IbrahimKhalid_K

For any queries, feel free to reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dashboard		Dashboard
DataSchema		DataSchema
Imgs		Imgs
Input_Data		Input_Data
Notebooks		Notebooks
Scripts		Scripts
README.md		README.md
docker-compose.yml		docker-compose.yml
start_demo.sh		start_demo.sh
stop_demo.sh		stop_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hadoop-Docker-Spark-Hive-DWH-Project

Architecture Diagram

Table of Contents

Introduction

Quick Start

Prerequisites

Starting the Cluster

Interfaces

DWH Schema

Hive and Hadoop Overview

Hive CLI Interaction

Hadoop Web UI: Directory Browser

Root Directory (`/data`)

Hive Warehouse Directory (`/user/hive/warehouse/gold.db`)

Reporting

overview

country

Troubleshooting

Contact Information

About

Uh oh!

Releases

Packages

Languages

IbrahimKhalid11/Hadoop-Docker-Spark-Hive-Data-Warehousing-Project

Folders and files

Latest commit

History

Repository files navigation

Hadoop-Docker-Spark-Hive-DWH-Project

Architecture Diagram

Table of Contents

Introduction

Quick Start

Prerequisites

Starting the Cluster

Interfaces

DWH Schema

Hive and Hadoop Overview

Hive CLI Interaction

Hadoop Web UI: Directory Browser

Root Directory (/data)

Hive Warehouse Directory (/user/hive/warehouse/gold.db)

Reporting

overview

country

Troubleshooting

Contact Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Root Directory (`/data`)

Hive Warehouse Directory (`/user/hive/warehouse/gold.db`)

Packages