This project follows the complete machine learning life cycle to solve a real-world classification problem using the 1994 Census Income Dataset. The objective is to predict whether an individual's income exceeds $50K/year based on demographic and work-related attributes.
This notebook includes:
- Exploratory data analysis (EDA)
- Data cleaning and preprocessing
- Feature engineering and selection
- Model training and evaluation using:
- Logistic Regression
- Random Forest
- Neural Networks
File: censusData.csv
Source: UCI Machine Learning Repository (modified for educational use)
- Type: Supervised Learning
- Category: Binary Classification
- Target Variable:
income_binary(0 = β€50K, 1 = >50K) - Features:
- Continuous:
age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week - Categorical (one-hot encoded):
workclass,education,marital-status,occupation,relationship,race,sex_selfID,native-country
- Continuous:
- Imputed missing values using column means (for
ageandhours-per-week) or dropped incomplete rows - Encoded the target variable using
LabelEncoder - Applied one-hot encoding to all categorical features
- Standardized features using
StandardScaler
- Used Random Forest to extract the top 50 most important features to reduce noise
- 3 hidden layers with ReLU activation
- Batch Normalization and Dropout used for regularization
- Optimizers: Tested with both
SGDandAdam - Trained for 115 epochs with early insight logging every 5 epochs
Performance:
- Accuracy on test set: ~83.9%
- Good generalization with training vs. validation accuracy curves closely aligned
- Implemented with
sklearn.linear_model.LogisticRegression - Achieved comparable performance to neural network
Performance:
- Accuracy on test set: ~84.2%
- Simple, fast to train, and interpretable
- Metric used: Accuracy
- Also evaluated predicted probabilities vs. actual classes
- Plotted training/validation loss and accuracy over epochs
- Compared model complexity vs. performance for fairness
- Model Selection matters: simple models like Logistic Regression can match the performance of more complex ones for some datasets.
- Feature Engineering and Preprocessing are critical: handling missing data and one-hot encoding were essential to model success.
- Bias/Fairness Consideration: This problem sheds light on how income inequality correlates with social and demographic features, which can have ethical implications in real-world deployments.
- Python 3
- Pandas, NumPy
- Scikit-learn
- TensorFlow / Keras
- Matplotlib, Seaborn
This notebook was completed as Lab 8 of the Machine Learning curriculum. The lab required defining a predictive task, developing a complete ML pipeline, and reflecting on model performance and fairness. The Census dataset was chosen for its rich demographic insights and relevance to social equity research.