This project develops a machine learning model using Logistic Regression to predict loan approval status for applicants.
The dataset contains historical loan applications with features such as income, employment type, number of dependents, credit score, and asset values.
The primary objective is to provide a data-driven decision support tool for banks and financial institutions to assess loan risk efficiently.
- Dataset (
loan.csv) loaded using Pandas. - Checked for missing values and cleaned column names for consistency.
- Selected features:
no_of_dependentseducationself_employedincome_annumloan_amountloan_termcibil_scoreresidential_assets_valuecommercial_assets_valueluxury_assets_valuebank_asset_value
- Applied Label Encoding for the target variable (
loan_status). - Applied One-Hot Encoding for categorical features (
education,self_employed).
- Dropped rows with missing values to ensure data quality.
- Ensured train and test sets have matching columns after encoding.
- Data split into 80% training and 20% testing using
train_test_split.
- Implemented Logistic Regression (
max_iter=1000) for binary classification. - Trained on the encoded feature set.
- Evaluated model performance using accuracy score on test data.
- Visualized results using a Confusion Matrix to show true positives, true negatives, false positives, and false negatives.
✅ Accuracy on Test Data: 83.1%
✅ Confusion Matrix highlights the distribution of true positives, true negatives, false positives, and false negatives.
✅ Strong predictive capability achieved with logistic regression and selected features.
- Programming Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Scikit-Learn
- Environment: Jupyter Notebook / Google Colab
- Banks & Financial Institutions → Automate risk assessment and loan approval processes.
- Loan Officers → Reduce manual evaluation time and improve decision consistency.
- Data Analysts → Identify key features influencing loan approval.
- Customers → Understand factors that improve loan approval chances.
Although the model achieves solid accuracy, some limitations exist:
- Dataset Size & Quality: Limited dataset size or quality may reduce generalizability.
- Feature Limitations: Excludes some financial or behavioral factors (e.g., applicant history, co-applicants).
- Model Choice: Logistic Regression is interpretable but may be outperformed by more complex models.
🔮 Planned Improvements:
- Incorporate additional features (e.g., applicant history, co-applicants, collateral).
- Perform hyperparameter tuning with
GridSearchCV. - Explore advanced models like Random Forest or XGBoost for better performance.
- Deploy as an interactive web application (e.g., Streamlit/Django) for real-time loan predictions.
- File:
loan.csv - Size: Not specified
- Columns:
loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
Daud Ibrahim Hassan
📌 Data Analyst & Computer Science Student (BRAC University)
🔗 LinkedIn | GitHub