Distributed Data Processing using PySpark

Project Overview

This project demonstrates a scalable data processing pipeline using Apache Spark (PySpark) to handle large datasets efficiently. The focus is on comparing distributed computing with traditional sequential processing and showcasing performance improvements.

The system processes large-scale student performance data by simulating big data conditions and applying ETL (Extract, Transform, Load) operations in a distributed environment.

Key Features

Built a distributed ETL pipeline using PySpark
Scaled dataset from thousands to millions of records to simulate real-world workloads
Applied data transformations including filtering, aggregations, and feature engineering
Executed SQL-based queries using Spark SQL
Benchmarked performance against Pandas (sequential processing)

Performance Insights

Processed over 1M+ records efficiently using Spark
Achieved ~4x performance improvement compared to Pandas
Demonstrated benefits of in-memory computation and parallel execution

Technical Implementation

Data Processing: Filtering, joins, aggregations, window functions
Pipeline Design: Batch processing with partitioned data writes
Optimization: Tuned partition sizes and execution strategies for better performance

Tech Stack

Apache Spark (PySpark)
Python
SQL
Pandas (for benchmarking)

How to Run

Install dependencies:
- Python 3.x
- PySpark
- Java (JDK 8 or above)
Run the notebook or script:
- Load dataset
- Execute ETL pipeline
- View results and performance metrics

Outcome

This project highlights how distributed systems like Spark outperform traditional data processing methods and provides hands-on experience in building scalable data pipelines for real-world data engineering tasks. """

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Apache_Spark.ipynb		Apache_Spark.ipynb
Final_Report_Output.csv.csv		Final_Report_Output.csv.csv
README.md		README.md
StudentsPerformance.csv		StudentsPerformance.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Processing using PySpark

Project Overview

Key Features

Performance Insights

Technical Implementation

Tech Stack

How to Run

Outcome

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Processing using PySpark

Project Overview

Key Features

Performance Insights

Technical Implementation

Tech Stack

How to Run

Outcome

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages