Author: Diego Gutierrez
-
NOTE; to skip the project overview & head staight to use the model, click here
Heart disease is one of the leading causes of death worldwide, with Coronary Artery Disease (CAD) being a major contributor. CAD is a chronic condition where the coronary arteries become narrowed or blocked, reducing blood flow to the heart muscle. If left undiagnosed and untreated, CAD can lead to severe complications, including a heart attack. Early detection and intervention are critical for preventing such outcomes.
This project aims to leverage machine learning to predict the probability of CAD and assess whether a patient is at risk of heart disease based on clinical data.
The solution involves:
- Training and evaluating four different machine learning models: Logistic Regression, Decision Tree, Random Forest, and XGBoost.
- Selecting the best-performing model based on the ROC AUC Score, which evaluates the model's ability to distinguish between patients with and without CAD.
- Deploying the chosen model using Docker to AWS Elastic Beanstalk, enabling scalable access through an API.
The deployed API will allow users to input patient data and receive real-time predictions of CAD probability. This tool can empower healthcare providers with actionable insights for early diagnosis and improved patient outcomes.
- Containerization: Docker
- Model Deployment: AWS Elastic Beanstalk
- Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, XGBoost
- Dependency Management: Pipenv
- Notebooks: Jupyter Notebook
The dataset used in this project is designed to support heart disease prediction and analysis, specifically assessing coronary artery disease (CAD) through various health metrics, such as, age, cholesterol levels, and chest pain types, provides a foundation for building predictive models for heart disease. The data was obtained from Kaggle and is available at Heart Attack Analysis & Prediction Dataset.
The CSV file, heart.csv, contains information related to patient health metrics and the dataset is provided in the data directory. Below are the columns included in the dataset:
- age: Age of the patient in years.
- sex: Sex of the patient (0 = female; 1 = male).
- cp: Chest pain type (0 = asymptomatic; 1 = typical angina; 2 = atypical angina; 3 = non-anginal pain).
- trtbps: Resting blood pressure (measured in mm Hg at hospital admission).
- chol: Serum cholesterol in mg/dl.
- fbs: Fasting blood sugar (>120 mg/dl; 1 = true, 0 = false).
- restecg: Resting electrocardiographic results (0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy).
- thalachh: Maximum heart rate achieved.
- exng: Exercise-induced angina (0 = no; 1 = yes).
- oldpeak: ST depression induced by exercise relative to rest.
- slp: Slope of the peak exercise ST segment (0 = downsloping; 1 = upsloping; 2 = flat)
- caa: Number of major vessels (0-3) colored by fluoroscopy.
- thall: Thalassemia type (1 = fixed defect; 2 = normal; 3 = reversible defect).
- output: Diagnosis of heart disease (angiographic disease status; 0 = <50% diameter narrowing, 1 = >50% diameter narrowing).
-
Angina
Angina is a type of chest pain or discomfort caused by reduced blood flow to the heart muscle. It is often a symptom of coronary artery disease and can feel like pressure, squeezing, or tightness in the chest. -
Cholesterol
Cholesterol is a waxy, fat-like substance found in the blood. While the body needs cholesterol to build healthy cells, high levels can lead to the development of fatty deposits in blood vessels, increasing the risk of heart disease and stroke. -
ECG (Electrocardiogram)
An ECG is a medical test that measures the electrical activity of the heart. It is used to detect heart conditions by recording the heart’s rhythm and electrical signals, helping identify abnormalities like arrhythmias or ischemia. -
ST Depression
ST depression refers to a downward shift of the ST segment on an ECG, which may indicate myocardial ischemia (reduced blood flow to the heart) or other conditions like electrolyte imbalances or cardiac strain. -
Thalassemia
Thalassemia is an inherited blood disorder characterized by the body’s inability to produce adequate hemoglobin, the protein in red blood cells that carries oxygen. This condition can lead to anemia and other complications, varying in severity depending on the type.
The entire model training and selection process is documented in the Jupyter Notebook located HERE.
In this notebook:
- The dataset was loaded from a CSV file into a Pandas DataFrame.
- Data cleaning and preprocessing steps were performed to handle missing values and prepare the data for modeling.
- Exploratory Data Analysis (EDA) was conducted to gain insights into the data distribution and relationships between features.
- Four different machine learning models (Logistic Regression, Decision Tree, Random Forest, and XGBoost) were trained and hyperparameters were tuned.
- Each model was evaluated using the ROC AUC Score, which measures the ability to distinguish between patients with and without CAD.
The best-performing model, based on the ROC AUC Score, was selected for deployment.
-
data - Folder with dataset files.
- heart.csv - Raw heart patient data.
-
images - Folder with visual assets for the project.
- heart.jpg - Image of a heart used in documentation or visualization.
- heart_attack_model_diagram.drawio.svg - Diagram visualizing the machine learning model pipeline.
-
models - Folder for storing trained model artifacts.
- model.bin - Serialized machine learning model (created during training).
-
notebooks - Folder containing Jupyter notebooks.
- notebook.ipynb - Main notebook for EDA, data preparation, and model training.
-
src - Folder with Python scripts for core functionality.
- predict.py - Script to serve the model for predictions.
- train.py - Script to train the machine learning model.
-
tests - Folder containing test scripts.
- predict_test.py - Unit test for the prediction functionality.
-
Project Root Files
- Pipfile - Dependency file for the project environment.
- Pipfile.lock - Locked dependency versions.
- README.md - Documentation file for project description.
- Dockerfile - Dockerfile for containerizing the application.
- .dockerignore - File specifying files to ignore during Docker build.
- .gitignore - Specifies files to ignore in version control.
NOTE; During the Zoomcamp evaluation process, the project will already be live on the cloud. Therefore, after activating the environment, you can skip to step 7 to use the API and ignore deployment steps 5 and 6.
Necessary tools for running the project: python 3.11, pipenv and Docker.
- Installing python 3.11: Download the python 3.11 version from the website. If already installed on Ubuntu with a lower version, use the below:
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt-get install python3.11- Installing pipenv:
pip install pipenv- Installing Docker:
sudo apt-get install docker.ioUse docker without sudo.
sudo groupadd docker
sudo gpasswd -a $USER docker
sudo service docker restart- Clone the repository.
git clone https://github.com/dieegogutierrez/heart-attack-model.git- Move to the project directory.
cd heart-attack-model- Create the environment.
pipenv install- Activate the environment.
pipenv shell- Deploy the model locally with Docker.
docker build -t heart-attack-prediction .
docker run -it -p 9696:9696 heart-attack-prediction:latest- Deploy the model on AWS cloud Elastic Beanstalk.
- Generate Access Keys. Click on User at right up corner > Security Credentials > Access Keys
- Initiate docker container on Elastic Beanstalk. A prompt to enter ID and Key will show.
eb init -p docker heart-attack-prediction
eb create heart-attack-prediction-env --enable-spot- If prompted with error, use the -i flag and go step by step:
eb init -i- A URL will show at the end. Update the file predict_test.py with the URL in order to test your API serving.
- Terminate the cloud serving, after used.
eb terminate heart-attack-prediction-env- Test the API.
cd tests
python predict_test.py- Update the file predict_test.py with different patient information to test others.
- Use different hyperparameters tuning.
- Use different models.
- Use machine learn to fill the missing values instead of deleting them.
Acknowledgement to DataTalksClub! for mentoring us through the Machine Learn Zoom Camp. It has been a privilege to take part in the '24 Cohort, go and check them out!
"DataTalks.Club - the place to talk about data! We are a community of people who are passionate about data. Join us to talk about everything related to data, to learn more about applied machine learning with our free courses and materials, to discuss the engineering aspects of data science and analytics, to chat about career options and learn tips and tricks for the job interviews, to discover new things and have fun!
Our weekly events include:
👨🏼💻 Free courses and weekly study groups where you can start practicing within a friendly community of learners
🔧 Workshops where you can get hands-on tutorials about technical topics
⚙️ Open-Source Spotlight, where you can discover open-source tools with a short demo video
🎙 Live Podcasts with practitioners where they share their experience (and the recordings too)
📺 Webinars with slides, where we discuss technical aspects of data science"
