A machine learning pipeline for analyzing Spotify playlist data and generating music recommendations using collaborative filtering and embedding-based approaches.
This project processes the Million Playlist Dataset (MPD) to build music recommendation systems using various machine learning techniques including:
- Collaborative filtering with co-occurrence matrices
- Item2Vec embeddings for track representations
- Hyperparameter tuning and experimentation
- Evaluation metrics for recommendation quality
spotifyprofiler/
├── data/MPD/ # Million Playlist Dataset
├── pipeline/ # Core processing scripts
│ ├── mpd_processor.py # MPD data processing
│ ├── build_co_occurrence.py # Co-occurrence matrix builder
│ ├── build_track_vocab.py # Track vocabulary builder
│ ├── item2vec_trainer.py # Item2Vec model trainer
│ └── reccobeats_client.py # Recommendation client
├── tuning/ # Hyperparameter tuning results
│ ├── checkpoints/ # Model checkpoints
│ ├── embeddings/ # Trained embeddings
│ └── experiment_results/ # Experiment results
├── requirements.txt # Python dependencies
└── run_*.py # Execution scripts
-
Clone the repository
git clone <your-repo-url> cd spotifyprofiler
-
Set up virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp env_template.txt .env # Edit .env with your configuration
-
Process MPD data
python pipeline/mpd_processor.py
-
Build co-occurrence matrix
python pipeline/build_co_occurrence.py
-
Build track vocabulary
python pipeline/build_track_vocab.py
-
Train Item2Vec embeddings
python pipeline/item2vec_trainer.py
-
Run hyperparameter tuning
python run_second_round.py
from pipeline.reccobeats_client import RecCoBeatsClient
client = RecCoBeatsClient()
recommendations = client.get_recommendations(playlist_tracks)The project uses JSON configuration files for different experiments:
best_second_round_config_working.json- Best performing configurationsecond_round_checkpoint_working.json- Training checkpoint- Various experiment configs in
tuning/experiment_results/
The project includes extensive hyperparameter tuning results stored in tuning/experiment_results/ with metrics including:
- Precision@K
- Recall@K
- NDCG@K
- MRR (Mean Reciprocal Rank)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
[Add your license here]
- Million Playlist Dataset (MPD) for providing the training data
- Item2Vec paper for the embedding approach
- Various open-source libraries used in this project