Skip to content
View rkpm22's full-sized avatar
🌐
Busy Building!
🌐
Busy Building!

Block or report rkpm22

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
rkpm22/README.md

Hi there, I'm Raunaksingh Khalsa 👋

Profile view counter on GitHub

"The goal is to turn data into information, and information into insight."
Carly Fiorina

About Me

Masters in Data Science Student at Northeastern University with 3.5+ years of data engineering experience. Passionate about Machine Learning, AI, and Data Science with hands-on experience in data engineering, feature selection, and building scalable data pipelines for analytics and ML applications.

Here's a bit about what I've worked on:

Michelin - Integration and Data Engineer

Global leader in tire manufacturing and sustainable mobility solutions, renowned for innovation in automotive technology and the Michelin Guide.

Built real-time data streaming systems using Kafka and Spring Boot for ML-ready data processing, specializing in feature engineering and data governance. Optimized data selection and cleaning processes, reducing unnecessary features by identifying business-critical data points. Designed data flows for Corporate Data Lake integration, enabling downstream analytics and ML model training.

Medialytics - Data Science Intern

Data-driven startup focused on social media analytics and digital marketing intelligence.

Developed regression and classification models in Python and R on social media data, creating real-time trend prediction dashboards with Streamlit. Performed sentiment analysis and feature engineering on social media text and engagement data to identify content performance drivers. Built interactive visualizations using Matplotlib and Plotly, enabling data-driven strategies that achieved 100% increase in engagement and 82% rise in reach.

Key Projects

Azure End-to-End Data Engineering Pipeline | E-commerce Analytics

  • Designed and implemented production-grade ETL pipeline integrating 8 datasets from multi-source architecture (GitHub/HTTP, MySQL, MongoDB) using parameterized Azure Data Factory with dynamic ForEach loops and metadata-driven execution.
  • Performed complex PySpark transformations in Databricks including 5+ joins, aggregations, and NoSQL enrichment, implementing Medallion Architecture (Bronze/Silver/Gold) with broadcast join optimization and date-partitioned Parquet storage.
  • Configured OAuth2 Service Principal authentication, IAM role-based access control, and deployed Synapse Analytics external tables for business intelligence consumption.
  • Tech Stack: Azure Data Factory, Databricks (PySpark), ADLS Gen2, Synapse Analytics, MySQL, MongoDB, Parquet, Python

MediQueryAI - AI-Powered Medical Chatbot

  • Engineered a medical chatbot using Retrieval-Augmented Generation (RAG) architecture to process 16,407 medical Q&A pairs, integrating FAISS vector database with Sentence Transformers (all-MiniLM-L6-v2, 384-dim embeddings) for semantic search and GPT-3.5-turbo for natural language generation.
  • Built scalable end-to-end pipeline achieving <2-second response latency, featuring modular Python codebase (435 lines) with optimized IndexFlatL2 similarity search and intelligent prompt engineering.
  • Developed interactive Streamlit web interface with real-time conversational UI, session state management, comprehensive error handling, and secure API integration.

⚖️ LegalClause: Legal Contract Analysis

NLP Classification & Risk Analysis | Fine-tuned Legal-BERT

Developed an intelligent legal assistant for automated contract clause classification and risk analysis using fine-tuned Legal-BERT model. Achieved 97.87% accuracy in identifying clause types (Cap on Liability, Audit Rights, Insurance), significantly outperforming baseline TF-IDF + SVM model (85.16% accuracy). Integrated Google's Gemini API for comprehensive risk analysis and mitigation suggestions, deployed via Streamlit chat interface.


🏠 Bangalore House Price Prediction

Real Estate Analytics | 13K+ Properties

Performed comprehensive data wrangling and feature engineering on 13,000+ real estate records, reducing location dimensionality by 85% and engineering key metrics like price per square foot. Achieved 80% accuracy using Random Forest, outperforming Linear Regression and XGBoost, and deployed the model via Flask web application for real-time price predictions.


🚦 Traffic Flow Prediction System

Real-time Traffic Classification | 2,976 Records

Built a machine learning-powered web application that predicts traffic conditions into four categories (Low, Normal, High, Heavy) using Random Forest classifier. Analyzed vehicle counts and time-based features with StandardScaler normalization, achieving accurate real-time traffic classification. Deployed interactive Streamlit dashboard for instant traffic condition predictions with intuitive vehicle count controls.


🏦 Loan Default Prediction

Advanced ML Classification | 100K+ Records

Preprocessed 100K+ loan records with comprehensive missing value handling and feature selection, followed by exploratory data analysis using Seaborn and Plotly. Trained and tuned XGBoost and Random Forest models, achieving 87% accuracy in predicting loan defaults through advanced feature engineering, encoding, and scaling techniques.


🛒 Retail Sales Analytics & Customer Segmentation

Advanced Analytics & Churn Prediction | RFM Analysis

Performed comprehensive customer segmentation using RFM analysis (Recency, Frequency, Monetary) on retail transaction data, creating 5 distinct customer segments from Champion to At Risk. Engineered key features including TotalRevenue and time-based variables, conducted churn analysis with 90-day threshold, and identified seasonal sales patterns. Delivered actionable marketing strategies through advanced data visualization and customer behavior analysis.

🛠️ Technical Skills

Programming Languages

Python R Java SQL JavaScript

Machine Learning & AI

Scikit-learn PyTorch TensorFlow Keras XGBoost Transformers OpenAI LangChain

Data Processing & Analytics

Pandas NumPy Apache Kafka Apache Spark Databricks

Data Visualization

Matplotlib Plotly Seaborn Power BI Tableau

Cloud & DevOps

AWS Google Cloud Azure Docker Kubernetes Git Oracle

Frameworks & Tools

Streamlit Flask FastAPI Spring Boot Jupyter HuggingFace Gemini API MLflow

📫 Let's Connect

Email LinkedIn


Feel free to explore my repositories for projects in AI, ML, Data Engineering and more! 💡 🚀

Pinned Loading

  1. BrazilianEcommerceOlist BrazilianEcommerceOlist Public

    A comprehensive data engineering project implementing a production-grade data pipeline using Azure services, processing Brazilian e-commerce data through a Medallion architecture.

    Jupyter Notebook

  2. MediQueryAI MediQueryAI Public

    A RAG powered medical FAQ chatbot that provides accurate, context-aware answers to medical questions using semantic search and AI generation.

    Python

  3. LegalClause LegalClause Public

    LegalClause is an intelligent legal assistant that transforms manual contract review through automated clause classification and risk analysis.

    Jupyter Notebook

  4. predict-traffic-flow predict-traffic-flow Public

    Intelligent traffic analysis system that leverages machine learning to predict traffic conditions in real-time, built with Python and Streamlit

    Jupyter Notebook

  5. bengaluru_house_price_prediction bengaluru_house_price_prediction Public

    Housing Price Prediction using ML (XGBoost, RandomForest, LinearRegression, DecisionTrees)

    Jupyter Notebook

  6. retail-sales-data-analysis retail-sales-data-analysis Public

    Retail Sales Analytics & Customer Segmentation

    Jupyter Notebook