Skip to content

ChaitanyaSaik/Distributed-Data-Processing-using-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Data Processing using PySpark

Project Overview

This project demonstrates a scalable data processing pipeline using Apache Spark (PySpark) to handle large datasets efficiently. The focus is on comparing distributed computing with traditional sequential processing and showcasing performance improvements.

The system processes large-scale student performance data by simulating big data conditions and applying ETL (Extract, Transform, Load) operations in a distributed environment.

Key Features

  • Built a distributed ETL pipeline using PySpark
  • Scaled dataset from thousands to millions of records to simulate real-world workloads
  • Applied data transformations including filtering, aggregations, and feature engineering
  • Executed SQL-based queries using Spark SQL
  • Benchmarked performance against Pandas (sequential processing)

Performance Insights

  • Processed over 1M+ records efficiently using Spark
  • Achieved ~4x performance improvement compared to Pandas
  • Demonstrated benefits of in-memory computation and parallel execution

Technical Implementation

  • Data Processing: Filtering, joins, aggregations, window functions
  • Pipeline Design: Batch processing with partitioned data writes
  • Optimization: Tuned partition sizes and execution strategies for better performance

Tech Stack

  • Apache Spark (PySpark)
  • Python
  • SQL
  • Pandas (for benchmarking)

How to Run

  1. Install dependencies:

    • Python 3.x
    • PySpark
    • Java (JDK 8 or above)
  2. Run the notebook or script:

    • Load dataset
    • Execute ETL pipeline
    • View results and performance metrics

Outcome

This project highlights how distributed systems like Spark outperform traditional data processing methods and provides hands-on experience in building scalable data pipelines for real-world data engineering tasks. """

About

This project demonstrates a scalable data processing pipeline using Apache Spark (PySpark) to handle large datasets efficiently. The focus is on comparing distributed computing with traditional sequential processing and showcasing performance improvements.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors