This project's purpose is to take a song with multiple instruments as input and split it into its consitituent instrument tracks using machine learning. So the output can be either of the constituent tracks of the song or any mixture of combination of them.
The output layers need to be tuned according to the data the model is being trained on. For example, if we only want the drums and accompaniments( mixture of every other track) tracks as the output, we will need to generate that data first, and then adjust the output layers of the model accordingly to generate two tracks of the same size as the input.
The model used is a CNN U-Net based on this research paper - https://arxiv.org/pdf/1810.11520.pdf .
Some architectural choices are inspired from this repository - https://github.com/mohammadreza490/music-source-separation-using-Unets. Shout out to mohammadreza490, since it was a big help in the making of this project!
The steps taken are as follows:
- Generate the audio tracks as decided by the usecase. For the sake of this example lets suppose we want the drums and accompaniments tracks. So we will generate the drums and accompaniments tracks for all the songs in the dataset, using the STEMS for each song.
- Load the tracks using the Librosa library.
- Preprocess the data. This involves segmenting the tracks into smaller track lengths and padding the remaining length.
- Saving the audio data as dictionary (in .npy format), so that it can be loaded later without having to process data again.
- Convert the data from audio format (time domain) to image format (time-frequency domain) which can be processed by our CNN. We will use STFT (Short-Time Fourier Transform) to convert the audio to a spectrogram which is a representation of the track in time-frequency domain.
- Simultaneously saving these spectrograms as dictionary in HDF5 format on disk while generating them, so that it doesn't take up RAM memory.
- Creating a Tensorflow Dataset which loads a batch of track spectrograms from the HDF5 file, so that only the required data is loaded into the RAM.
- Train the U-Net on the image data with the original song segment's spectrogram as input, and seperated tracks' spectrograms (drums and accompaniments tracks for example) stacked on each other as expected prediction.
- Doing the same preprocessing for the track we need to predict, expect we dont need to save it in a HDF5 file, because it is considerably inexpensive in terms of memory as compared to the dataset.
- Making a prediction through the U-NET and postprocessing the output tracks, which includes joining the segments, adding back the lost phase information during the initial STFT, and converting the spectrograms to audio format through ISTFT (Inverse Short-Time Fourier Transform).
- Saving the output tracks as a .wav file.
All this is encapsulated inside the PipelineHandler class, and we only need to call the high level functions. Or we can just run the main.py file with the required parameters, and all this will be taken care of.
A part of the SLAKH dataset (http://www.slakh.com) was used to train this model. And its utility repository (https://github.com/ethman/slakh-utils/tree/master?tab=readme-ov-file#readme) was used to generate the audio tracks as required by the usecase.
- Copy the google colab notebook which can found in the 'notebooks' folder.
- Mount your google drive and clone the repository in the google colab environment (included in the notebook).
- You'll need to have the same folder structure in the drive as I've made in the project(or in the drive link below). Then just pass the path to this root folder to the PipelineHandler class, as can be seen in the notebook.
- Put the pretrained model weights into the 'saved_models' folder with the name 'modelCheckpoint.h5', and the song you want to seperate in 'data/song_to_seperate' with name 'seperateMyTracks.wav'.
- Then run the cell which is used for predicting results in the notebook.
- The results will be saved in 'results' folder.
Examples of seperated tracks, trained model weights, and other files used in this project can be found here: https://drive.google.com/drive/folders/1a2OcKI8fIIyNQirv2pj0-uCaBWbEourR?usp=sharing