CSE6748 - Applied Analytics Practicum
Georgia Institute of Technology | Fall 2025
๐ Quick Start โข ๐ Features โข ๐ Documentation โข ๐ฅ Team
- Overview
- Key Features
- Model Performance
- Technology Stack
- Quick Start
- Installation Guide
- Usage Examples
- Project Structure
- Model Details
- API Documentation
- Screenshots
- Team
- Acknowledgments
A production-ready web application that leverages Random Forest machine learning to predict early-stage construction costs with exceptional accuracy. Built for Construction Cost Database LLC, this tool provides Class 5 estimates (ยฑ25% accuracy) for infrastructure projects.
Target Requirement: MAPE < 25% Actual Performance: MAPE = 21.97% Result: โ Exceeded target by 3.03%
- Course: CSE6748 - Applied Analytics Practicum
- Institution: Georgia Institute of Technology
- Semester: Fall 2025
- Client: Construction Cost Database LLC
- Dataset: 17,025 historical projects (2010-2025)
|
Real-time exploratory data analysis with interactive visualizations Features:
|
Instant ML-powered cost estimates with confidence intervals Features:
|
Comprehensive model performance tracking and comparison Features:
|
| Metric | Value | Status |
|---|---|---|
| Test MAPE | 21.97% | โ Target Met |
| Rยฒ Score | 0.9463 | ๐ฏ Excellent |
| Test MAE | $271,543 | ๐ Strong |
| Test RMSE | $412,583 | ๐ Reliable |
| Dataset Size | 17,025 projects | ๐ฆ Large Scale |
| Model | CV MAPE | Test MAPE | Rยฒ Score | Status |
|---|---|---|---|---|
| ๐ฒ Random Forest | 23.30% | 21.97% โ | 0.9463 | ๐ Deployed |
| โก XGBoost | 36.16% | 36.37% | 0.9258 | ๐ Alternative |
| ๐ก LightGBM | 36.93% | 37.62% | 0.9232 | ๐ Alternative |
| ๐ Gradient Boosting | 42.48% | 43.75% | 0.9015 |
Why Random Forest?
- โ Best MAPE performance (21.97%)
- โ Highest Rยฒ score (0.9463)
- โ Excellent interpretability
- โ Robust to outliers
- โ Fast training and prediction
๐ Python 3.11+ - Core programming language
๐ถ๏ธ Flask 3.0.0 - Web framework
๐ค scikit-learn 1.3.2 - Machine learning
๐ผ Pandas 2.1.4 - Data manipulation
๐ข NumPy 1.26.2 - Numerical computing
๐ Matplotlib 3.8.2 - Visualizations
๐จ Seaborn 0.13.0 - Statistical plots
๐ HTML5 / CSS3 - Modern web standards
๐จ Bootstrap 5 - Responsive UI framework
๐ Chart.js - Interactive charts
โก Vanilla JavaScript - Dynamic interactions
๐ฒ Random Forest - Primary algorithm
๐ StandardScaler - Feature normalization
๐ท๏ธ OneHotEncoder - Categorical encoding
๐ฏ K-Means Clustering - Geographic regions
โ
5-Fold CV - Model validation
- Python 3.11 or higher installed
- pip package manager available
- 500MB free disk space
- Modern web browser
# Clone, setup, and run in one go
git clone https://github.com/kraryal/construction_new.git && \
cd construction_new && \
python -m venv venv && \
source venv/bin/activate && \
pip install -r requirements.txt && \
python app.pyWindows Users: Replace source venv/bin/activate with venv\Scripts\activate
๐ Open browser: http://localhost:5000
That's it! ๐
git clone https://github.com/kraryal/construction_new.git
cd construction_newWindows:
python -m venv venv
venv\Scripts\Activate.ps1Mac/Linux:
python3 -m venv venv
source venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtpython -c "import flask, pandas, sklearn; print('โ
Setup complete!')"# Ensure dataset is in correct location
ls data/base_data_for_model.csvpython app.pyExpected Output:
================================================================================
๐๏ธ ML-BASED CLASS 5 CONSTRUCTION COST ESTIMATOR
================================================================================
โ
Dataset loaded successfully: 17025 projects with 38 features
โ
Model loaded successfully
โ
System Ready!
๐ Access at: http://localhost:5000
================================================================================
- Navigate to Cost Estimator page
- Fill project details form
- Click "Calculate Cost Estimate"
- View prediction with confidence interval
Input Example
| Field | Value |
|---|---|
| Project Type | Pavement Markers |
| Budget Range | $3M-$6M |
| Complexity | Category 4 |
| State | Michigan (MI) |
| County | Alcona County |
| Area Type | Rural |
| Inflation Factor | 1.05 |
| ACF | 1.01 |
Output
Estimated Cost: $4,358,432.11
Confidence Range: $3,268,824 - $5,448,040
Similar Projects: 6,147 found
import requests
# Endpoint
url = 'http://localhost:5000/estimate_cost'
# Project data
data = {
'inflation_factor': 1.05,
'official_budget_range': '$3M-$6M',
'ciqs_complexity_category': 'Category 4',
'cnt_division': 6,
'cnt_item_code': 6,
'county_name': 'Alcona County',
'area_type': 'Rural',
'acf': 1.01,
'project_type': 'Pavement Markers',
'project_category': 'Civil',
'project_state': 'MI',
'region': 'Region_3'
}
# Make request
response = requests.post(url, data=data)
result = response.json()
# Display results
if result['success']:
print(f"๐ฐ Estimated Cost: {result['estimated_cost_formatted']}")
print(f"๐ Confidence Range: {result['confidence_interval']['lower_formatted']} - {result['confidence_interval']['upper_formatted']}")
print(f"๐ Similar Projects: {result['similar_projects']['count']}")curl -X POST http://localhost:5000/estimate_cost \
-d "inflation_factor=1.05" \
-d "official_budget_range=\$3M-\$6M" \
-d "ciqs_complexity_category=Category 4" \
-d "cnt_division=6" \
-d "cnt_item_code=6" \
-d "county_name=Alcona County" \
-d "area_type=Rural" \
-d "acf=1.01" \
-d "project_type=Pavement Markers" \
-d "project_category=Civil" \
-d "project_state=MI" \
-d "region=Region_3"construction_new/
โ
โโโ ๐ app.py # Main Flask application (500+ lines)
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ README.md # Project documentation (this file)
โโโ ๐ INSTRUCTIONS.md # Detailed setup guide
โ
โโโ ๐ data/
โ โโโ ๐ base_data_for_model.csv # Training dataset (17,025 projects)
โ
โโโ ๐ models/
โ โโโ ๐ค construction_cost_model.pkl # Trained Random Forest model
โ โโโ ๐ model_metrics.json # Performance metrics
โ
โโโ ๐ templates/ # HTML templates
โ โโโ ๐ home.html # Landing page with cards
โ โโโ ๐ eda.html # Exploratory data analysis
โ โโโ ๐ model_comparison.html # Algorithm comparison
โ โโโ ๐ฏ dashboard.html # Performance dashboard
โ โโโ ๐ฐ cost_estimator.html # Prediction form (main feature)
โ โโโ ๐๏ธ data_overview.html # Dataset information
โ โโโ ๐ documentation.html # API docs & team info
โ โโโ ๐งญ base.html # Base template with navigation
โ โโโ โ error.html # Error handling page
โ
โโโ ๐ static/
โโโ ๐จ css/
โ โโโ styles.css # Custom styles
โโโ ๐ผ๏ธ images/ # Generated plots & assets
๐ Click to expand feature list
- Inflation Factor - Range: 1.00 - 1.34 | Adjusts for year-over-year cost changes
- Area Cost Factor (ACF) - Range: 0.80 - 1.19 | Geographic cost adjustment multiplier
- Project Type - Categorical | Specific construction work type (e.g., Pavement Markers)
- Project Category - Categorical | General classification (e.g., Civil, Water & Sewer)
- CIQS Complexity Category - Category 1-4 | Complexity rating from simple to complex
- Official Budget Range - Categorical | Budget bracket (e.g., $3M-$6M, Less than 1M)
- Project State - 50 US states | Location identifier
- County Name - Varies by state | Specific county location
- Area Type - Urban/Rural | Development density classification
- Region - Region_0 to Region_3 | K-Means clustered geographic zones
- CNT Division Code - Range: 1 - 29 | Construction division taxonomy
- CNT Item Code - Range: 1 - 61 | Specific item classification
graph LR
A[Raw Data<br/>17,025 projects] --> B[Data Cleaning<br/>Fill missing values]
B --> C[Feature Engineering<br/>K-Means clustering]
C --> D[Train/Test Split<br/>80/20]
D --> E[Preprocessing<br/>StandardScaler + OneHotEncoder]
E --> F[Model Training<br/>Random Forest]
F --> G[Validation<br/>5-Fold CV]
G --> H[Production Model<br/>21.97% MAPE]
-
Missing Value Imputation
- Numerical: Median
- Categorical: Mode
-
Feature Engineering
- K-Means clustering for geographic regions
- Created 4 regional clusters from state coordinates
-
Feature Scaling
- StandardScaler for numerical features
- OneHotEncoder for categorical features
-
Train/Test Split
- 80% training (13,620 samples)
- 20% testing (3,405 samples)
- Random state: 42 (reproducible)
RandomForestRegressor(
n_estimators=100, # Number of decision trees
random_state=42, # Reproducibility seed
n_jobs=-1 # Use all CPU cores
)POST /estimate_cost
Request Body (Form Data):
{
"inflation_factor": 1.05,
"official_budget_range": "$3M-$6M",
"ciqs_complexity_category": "Category 4",
"cnt_division": 6,
"cnt_item_code": 6,
"county_name": "Alcona County",
"area_type": "Rural",
"acf": 1.01,
"project_type": "Pavement Markers",
"project_category": "Civil",
"project_state": "MI",
"region": "Region_3"
}Success Response (200):
{
"success": true,
"estimated_cost": 4358432.11,
"estimated_cost_formatted": "$4,358,432.11",
"confidence_interval": {
"lower": 3268824.08,
"upper": 5448040.14,
"lower_formatted": "$3,268,824.08",
"upper_formatted": "$5,448,040.14"
},
"similar_projects": {
"count": 6147,
"avg_cost_formatted": "$1,310,815.37",
"median_cost_formatted": "$856,470.56",
"match_type": "exact"
},
"timestamp": "2025-11-25 14:30:22"
}GET /api/dataset_stats
Response:
{
"total_projects": 17025,
"avg_cost": 1142356.78,
"median_cost": 856470.56,
"min_cost": 10500.00,
"max_cost": 15200000.00
}GET /api/model_metrics
Response:
{
"test_mape": 21.97,
"r2_score": 0.9463,
"mae": 271543.90,
"rmse": 412583.00,
"n_features": 13
}๐ผ๏ธ Click to view screenshots
Clean input form for entering construction project details.
Example of the form with sample data entered for cost prediction.
Comparison of different machine learning models' performance metrics.
Interactive dashboard showing key project analytics and insights.
|
Krishna Aryal |
Kumar Sawan |
Neema Kafwimi |
- Construction Cost Database LLC - Dataset provider and project client
- Georgia Tech CSE6748 - Course faculty and teaching assistants
- scikit-learn Community - Open-source ML library
- Flask Team - Web framework development
- Stack Overflow Community - Problem-solving support
This project is part of an academic practicum for Georgia Institute of Technology.
If you use this work, please cite:
@misc{construction_cost_estimator_2025,
title={ML-Based Class 5 Construction Cost Estimator},
author={Aryal, Krishna and Sawan, Kumar and Kafwimi, Neema},
year={2025},
institution={Georgia Institute of Technology},
course={CSE6748 - Applied Analytics Practicum}
}Class 5 Estimates Only
This model provides conceptual estimates with ยฑ25% accuracy. Not suitable for detailed bidding, final estimates, or contractual commitments.
Historical Data Limitation
Model trained on 2010-2025 data. May not capture unprecedented market conditions, novel construction methods, or future trends.
Professional Validation Required
Always validate estimates with construction professionals and adjust for project-specific factors not captured by the model.
- Real-time cost index updates
- User authentication & project history
- Export to PDF/Excel
- Mobile responsive improvements
- Additional ML models (Neural Networks)
- Integration with external cost databases
- API rate limiting & authentication
- Multi-language support
- ๐ง Email: karyal@gatech.edu
- ๐ Documentation: Read the docs
- ๐ Live Demo: [Coming Soon]
- ๐ Dataset: PCS Historical Project Database
- ๐ Course: CSE6748 - Applied Analytics Practicum
Built with โค๏ธ by the Georgia Tech Team
"Accurate Early-Stage Cost Estimation Powered by Machine Learning"
๐ Home โข ๐ Dashboard โข ๐ฐ Estimate โข ๐ Docs
