This project contains a complete machine learning pipeline that trains a Random Forest classifier to predict loan actions based on mortgage/loan application data.
random_forest_model.py- Main script that loads data, preprocesses it, trains the Random Forest model, and evaluates performancemodel_analysis.py- Analysis script that loads the trained model and provides additional insightsrequirements.txt- Python package dependencies
random_forest_model.pkl- Saved trained Random Forest modellabel_encoders.pkl- Saved label encoders for categorical variablesfeature_importance.png- Feature importance plot (if matplotlib display is available)
The project uses three CSV files:
TrainingSet.csv(60,000 samples) - Used to train the modelTestSet.csv(20,000 samples) - Used to test model performanceValidationSet.csv(20,000 samples) - Used for additional validation
- Code 1: Loan originated (50.9% of test data)
- Code 2: Application approved but not accepted (1.2%)
- Code 3: Application denied (20.8%)
- Code 4: Application withdrawn by applicant (12.1%)
- Code 5: File closed for incompleteness (1.1%)
- Code 6: Purchased loan (13.9%)
- Code 8: Preapproval request denied
- Test Accuracy: 98.15%
- Validation Accuracy: 98.07%
- Cross-validation Accuracy: 98.06% (±0.22%)
- hoepa_status (13.7% importance) - High-cost mortgage indicator
- denial_reason_1 (8.1% importance) - Primary reason for denial
- initially_payable_to_institution (6.5% importance) - Institution payment indicator
- interest_rate (5.4% importance) - Loan interest rate
- applicant_credit_score_type (5.4% importance) - Type of credit score used
pip install -r requirements.txtpython random_forest_model.pypython model_analysis.py- Number of trees: 100
- Maximum depth: 10 (to prevent overfitting)
- Minimum samples to split: 5
- Minimum samples in leaf: 2
- Features: 98 (after preprocessing)
- Missing Value Handling: Median for numeric, mode for categorical
- Categorical Encoding: Label encoding for all categorical variables
- Feature Engineering: Automatic detection of numeric vs categorical columns
- Classification report with precision, recall, and F1-scores
- Confusion matrix analysis
- 5-fold cross-validation
- Feature importance ranking
- High Performance: The model achieves excellent accuracy (98%+) across all evaluation metrics
- Feature Importance: HOEPA status and denial reasons are the most predictive features
- Class Imbalance: Some action types (like Code 5) have fewer samples and are harder to predict
- Robust Model: Consistent performance across training, validation, and test sets
To improve the model further, consider:
- Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
- Handling class imbalance with techniques like SMOTE or class weights
- Feature selection to reduce dimensionality
- Ensemble methods combining multiple algorithms
- Deep learning approaches for complex feature interactions