This project demonstrates a scalable data processing pipeline using Apache Spark (PySpark) to handle large datasets efficiently. The focus is on comparing distributed computing with traditional sequential processing and showcasing performance improvements.
The system processes large-scale student performance data by simulating big data conditions and applying ETL (Extract, Transform, Load) operations in a distributed environment.
- Built a distributed ETL pipeline using PySpark
- Scaled dataset from thousands to millions of records to simulate real-world workloads
- Applied data transformations including filtering, aggregations, and feature engineering
- Executed SQL-based queries using Spark SQL
- Benchmarked performance against Pandas (sequential processing)
- Processed over 1M+ records efficiently using Spark
- Achieved ~4x performance improvement compared to Pandas
- Demonstrated benefits of in-memory computation and parallel execution
- Data Processing: Filtering, joins, aggregations, window functions
- Pipeline Design: Batch processing with partitioned data writes
- Optimization: Tuned partition sizes and execution strategies for better performance
- Apache Spark (PySpark)
- Python
- SQL
- Pandas (for benchmarking)
-
Install dependencies:
- Python 3.x
- PySpark
- Java (JDK 8 or above)
-
Run the notebook or script:
- Load dataset
- Execute ETL pipeline
- View results and performance metrics
This project highlights how distributed systems like Spark outperform traditional data processing methods and provides hands-on experience in building scalable data pipelines for real-world data engineering tasks. """