The Wikipedia Scraper is a tool designed to extract, clean, and process Music Meta data from Wikipedia pages. This tool is part of the repo repository and is located in the code/wikipedia directory. Follow these instructions to set up and run the scraper.
Before you start, ensure you have Python and Git installed on your system. You'll need Git to clone the repository and access the Wikipedia Scraper. If you're unsure whether you have Git or need to install it, please refer to the Git documentation. Getting Started
Start by cloning the repo repository to your local machine. Open a terminal or command prompt and run the following command:
git clone https://github.com/edin-dal/music_db
Change into the directory containing the Wikipedia Scraper script:
cd music_db/code/wikipedia
Before running the script, you need to ensure it has the necessary execution permissions. Grant execution permissions by running:
chmod +x ./scrape_clean.sh
Now, you're ready to run the scraper. Execute the script by running:
./scrape_clean.sh
The script will begin processing. This may take some time.
Upon successful completion, the script will create an output folder within the repo/code/wikipedia directory. This folder will contain the final cleaned and processed data extracted from Wikipedia. Troubleshooting
If you encounter permission errors while running ./scrape_clean.sh, ensure that you've correctly set the execution permissions as described in step 3. Ensure you are in the correct directory (repo/code/wikipedia) before running the script. Running it from a different directory may cause path-related errors.
For questions, issues, or support regarding the Wikipedia Scraper, please open an issue in the GitHub repository, and I'll get back to you as soon as possible.