Raunaksingh Khalsa rkpm22

Hi there, I'm Raunaksingh Khalsa 👋

"The goal is to turn data into information, and information into insight."
— Carly Fiorina

About Me

Masters in Data Science Student at Northeastern University with 3.5+ years of data engineering experience. Passionate about Machine Learning, AI, and Data Science with hands-on experience in data engineering, feature selection, and building scalable data pipelines for analytics and ML applications.

Here's a bit about what I've worked on:

Michelin - Integration and Data Engineer

Global leader in tire manufacturing and sustainable mobility solutions, renowned for innovation in automotive technology and the Michelin Guide.

Built real-time data streaming systems using Kafka and Spring Boot for ML-ready data processing, specializing in feature engineering and data governance. Optimized data selection and cleaning processes, reducing unnecessary features by identifying business-critical data points. Designed data flows for Corporate Data Lake integration, enabling downstream analytics and ML model training.

Medialytics - Data Science Intern

Data-driven startup focused on social media analytics and digital marketing intelligence.

Developed regression and classification models in Python and R on social media data, creating real-time trend prediction dashboards with Streamlit. Performed sentiment analysis and feature engineering on social media text and engagement data to identify content performance drivers. Built interactive visualizations using Matplotlib and Plotly, enabling data-driven strategies that achieved 100% increase in engagement and 82% rise in reach.

Key Projects

Azure End-to-End Data Engineering Pipeline | E-commerce Analytics

Designed and implemented production-grade ETL pipeline integrating 8 datasets from multi-source architecture (GitHub/HTTP, MySQL, MongoDB) using parameterized Azure Data Factory with dynamic ForEach loops and metadata-driven execution.
Performed complex PySpark transformations in Databricks including 5+ joins, aggregations, and NoSQL enrichment, implementing Medallion Architecture (Bronze/Silver/Gold) with broadcast join optimization and date-partitioned Parquet storage.
Configured OAuth2 Service Principal authentication, IAM role-based access control, and deployed Synapse Analytics external tables for business intelligence consumption.
Tech Stack: Azure Data Factory, Databricks (PySpark), ADLS Gen2, Synapse Analytics, MySQL, MongoDB, Parquet, Python

MediQueryAI - AI-Powered Medical Chatbot

Engineered a medical chatbot using Retrieval-Augmented Generation (RAG) architecture to process 16,407 medical Q&A pairs, integrating FAISS vector database with Sentence Transformers (all-MiniLM-L6-v2, 384-dim embeddings) for semantic search and GPT-3.5-turbo for natural language generation.
Built scalable end-to-end pipeline achieving <2-second response latency, featuring modular Python codebase (435 lines) with optimized IndexFlatL2 similarity search and intelligent prompt engineering.
Developed interactive Streamlit web interface with real-time conversational UI, session state management, comprehensive error handling, and secure API integration.

⚖️ LegalClause: Legal Contract Analysis

NLP Classification & Risk Analysis | Fine-tuned Legal-BERT

Developed an intelligent legal assistant for automated contract clause classification and risk analysis using fine-tuned Legal-BERT model. Achieved 97.87% accuracy in identifying clause types (Cap on Liability, Audit Rights, Insurance), significantly outperforming baseline TF-IDF + SVM model (85.16% accuracy). Integrated Google's Gemini API for comprehensive risk analysis and mitigation suggestions, deployed via Streamlit chat interface.

🏠 Bangalore House Price Prediction

Real Estate Analytics | 13K+ Properties

Performed comprehensive data wrangling and feature engineering on 13,000+ real estate records, reducing location dimensionality by 85% and engineering key metrics like price per square foot. Achieved 80% accuracy using Random Forest, outperforming Linear Regression and XGBoost, and deployed the model via Flask web application for real-time price predictions.

🚦 Traffic Flow Prediction System

Real-time Traffic Classification | 2,976 Records

Built a machine learning-powered web application that predicts traffic conditions into four categories (Low, Normal, High, Heavy) using Random Forest classifier. Analyzed vehicle counts and time-based features with StandardScaler normalization, achieving accurate real-time traffic classification. Deployed interactive Streamlit dashboard for instant traffic condition predictions with intuitive vehicle count controls.

🏦 Loan Default Prediction

Advanced ML Classification | 100K+ Records

Preprocessed 100K+ loan records with comprehensive missing value handling and feature selection, followed by exploratory data analysis using Seaborn and Plotly. Trained and tuned XGBoost and Random Forest models, achieving 87% accuracy in predicting loan defaults through advanced feature engineering, encoding, and scaling techniques.

🛒 Retail Sales Analytics & Customer Segmentation

Advanced Analytics & Churn Prediction | RFM Analysis

Performed comprehensive customer segmentation using RFM analysis (Recency, Frequency, Monetary) on retail transaction data, creating 5 distinct customer segments from Champion to At Risk. Engineered key features including TotalRevenue and time-based variables, conducted churn analysis with 90-day threshold, and identified seasonal sales patterns. Delivered actionable marketing strategies through advanced data visualization and customer behavior analysis.

🛠️ Technical Skills

Programming Languages

Machine Learning & AI

Data Processing & Analytics

Data Visualization

Cloud & DevOps

Frameworks & Tools

📫 Let's Connect

Feel free to explore my repositories for projects in AI, ML, Data Engineering and more! 💡 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly