"The goal is to turn data into information, and information into insight."
— Carly Fiorina
Masters in Data Science Student at Northeastern University with 3.5+ years of data engineering experience. Passionate about Machine Learning, AI, and Data Science with hands-on experience in data engineering, feature selection, and building scalable data pipelines for analytics and ML applications.
Global leader in tire manufacturing and sustainable mobility solutions, renowned for innovation in automotive technology and the Michelin Guide.
Built real-time data streaming systems using Kafka and Spring Boot for ML-ready data processing, specializing in feature engineering and data governance. Optimized data selection and cleaning processes, reducing unnecessary features by identifying business-critical data points. Designed data flows for Corporate Data Lake integration, enabling downstream analytics and ML model training.
Data-driven startup focused on social media analytics and digital marketing intelligence.
Developed regression and classification models in Python and R on social media data, creating real-time trend prediction dashboards with Streamlit. Performed sentiment analysis and feature engineering on social media text and engagement data to identify content performance drivers. Built interactive visualizations using Matplotlib and Plotly, enabling data-driven strategies that achieved 100% increase in engagement and 82% rise in reach.
- Designed and implemented production-grade ETL pipeline integrating 8 datasets from multi-source architecture (GitHub/HTTP, MySQL, MongoDB) using parameterized Azure Data Factory with dynamic ForEach loops and metadata-driven execution.
- Performed complex PySpark transformations in Databricks including 5+ joins, aggregations, and NoSQL enrichment, implementing Medallion Architecture (Bronze/Silver/Gold) with broadcast join optimization and date-partitioned Parquet storage.
- Configured OAuth2 Service Principal authentication, IAM role-based access control, and deployed Synapse Analytics external tables for business intelligence consumption.
- Tech Stack: Azure Data Factory, Databricks (PySpark), ADLS Gen2, Synapse Analytics, MySQL, MongoDB, Parquet, Python
- Engineered a medical chatbot using Retrieval-Augmented Generation (RAG) architecture to process 16,407 medical Q&A pairs, integrating FAISS vector database with Sentence Transformers (all-MiniLM-L6-v2, 384-dim embeddings) for semantic search and GPT-3.5-turbo for natural language generation.
- Built scalable end-to-end pipeline achieving <2-second response latency, featuring modular Python codebase (435 lines) with optimized IndexFlatL2 similarity search and intelligent prompt engineering.
- Developed interactive Streamlit web interface with real-time conversational UI, session state management, comprehensive error handling, and secure API integration.
NLP Classification & Risk Analysis | Fine-tuned Legal-BERT
Developed an intelligent legal assistant for automated contract clause classification and risk analysis using fine-tuned Legal-BERT model. Achieved 97.87% accuracy in identifying clause types (Cap on Liability, Audit Rights, Insurance), significantly outperforming baseline TF-IDF + SVM model (85.16% accuracy). Integrated Google's Gemini API for comprehensive risk analysis and mitigation suggestions, deployed via Streamlit chat interface.
Real Estate Analytics | 13K+ Properties
Performed comprehensive data wrangling and feature engineering on 13,000+ real estate records, reducing location dimensionality by 85% and engineering key metrics like price per square foot. Achieved 80% accuracy using Random Forest, outperforming Linear Regression and XGBoost, and deployed the model via Flask web application for real-time price predictions.
Real-time Traffic Classification | 2,976 Records
Built a machine learning-powered web application that predicts traffic conditions into four categories (Low, Normal, High, Heavy) using Random Forest classifier. Analyzed vehicle counts and time-based features with StandardScaler normalization, achieving accurate real-time traffic classification. Deployed interactive Streamlit dashboard for instant traffic condition predictions with intuitive vehicle count controls.
Advanced ML Classification | 100K+ Records
Preprocessed 100K+ loan records with comprehensive missing value handling and feature selection, followed by exploratory data analysis using Seaborn and Plotly. Trained and tuned XGBoost and Random Forest models, achieving 87% accuracy in predicting loan defaults through advanced feature engineering, encoding, and scaling techniques.
Advanced Analytics & Churn Prediction | RFM Analysis
Performed comprehensive customer segmentation using RFM analysis (Recency, Frequency, Monetary) on retail transaction data, creating 5 distinct customer segments from Champion to At Risk. Engineered key features including TotalRevenue and time-based variables, conducted churn analysis with 90-day threshold, and identified seasonal sales patterns. Delivered actionable marketing strategies through advanced data visualization and customer behavior analysis.
Feel free to explore my repositories for projects in AI, ML, Data Engineering and more! 💡 🚀