Skip to content

attentiondotnet/train_fandom_forest

Repository files navigation

Random Forest Model for Loan Action Prediction

This project contains a complete machine learning pipeline that trains a Random Forest classifier to predict loan actions based on mortgage/loan application data.

Files Created

Main Scripts

  • random_forest_model.py - Main script that loads data, preprocesses it, trains the Random Forest model, and evaluates performance
  • model_analysis.py - Analysis script that loads the trained model and provides additional insights
  • requirements.txt - Python package dependencies

Generated Files (after running)

  • random_forest_model.pkl - Saved trained Random Forest model
  • label_encoders.pkl - Saved label encoders for categorical variables
  • feature_importance.png - Feature importance plot (if matplotlib display is available)

Dataset Structure

The project uses three CSV files:

  • TrainingSet.csv (60,000 samples) - Used to train the model
  • TestSet.csv (20,000 samples) - Used to test model performance
  • ValidationSet.csv (20,000 samples) - Used for additional validation

Target Variable: action_taken

  • Code 1: Loan originated (50.9% of test data)
  • Code 2: Application approved but not accepted (1.2%)
  • Code 3: Application denied (20.8%)
  • Code 4: Application withdrawn by applicant (12.1%)
  • Code 5: File closed for incompleteness (1.1%)
  • Code 6: Purchased loan (13.9%)
  • Code 8: Preapproval request denied

Model Performance

Accuracy Scores

  • Test Accuracy: 98.15%
  • Validation Accuracy: 98.07%
  • Cross-validation Accuracy: 98.06% (±0.22%)

Top Important Features

  1. hoepa_status (13.7% importance) - High-cost mortgage indicator
  2. denial_reason_1 (8.1% importance) - Primary reason for denial
  3. initially_payable_to_institution (6.5% importance) - Institution payment indicator
  4. interest_rate (5.4% importance) - Loan interest rate
  5. applicant_credit_score_type (5.4% importance) - Type of credit score used

How to Run

1. Install Dependencies

pip install -r requirements.txt

2. Train and Test the Model

python random_forest_model.py

3. Analyze Results

python model_analysis.py

Model Details

Random Forest Configuration

  • Number of trees: 100
  • Maximum depth: 10 (to prevent overfitting)
  • Minimum samples to split: 5
  • Minimum samples in leaf: 2
  • Features: 98 (after preprocessing)

Data Preprocessing

  • Missing Value Handling: Median for numeric, mode for categorical
  • Categorical Encoding: Label encoding for all categorical variables
  • Feature Engineering: Automatic detection of numeric vs categorical columns

Model Evaluation

  • Classification report with precision, recall, and F1-scores
  • Confusion matrix analysis
  • 5-fold cross-validation
  • Feature importance ranking

Key Insights

  1. High Performance: The model achieves excellent accuracy (98%+) across all evaluation metrics
  2. Feature Importance: HOEPA status and denial reasons are the most predictive features
  3. Class Imbalance: Some action types (like Code 5) have fewer samples and are harder to predict
  4. Robust Model: Consistent performance across training, validation, and test sets

Next Steps

To improve the model further, consider:

  • Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
  • Handling class imbalance with techniques like SMOTE or class weights
  • Feature selection to reduce dimensionality
  • Ensemble methods combining multiple algorithms
  • Deep learning approaches for complex feature interactions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages