Ξ compilation of notebooks I created for Data Science related tasks like Tutorials, Generative AI, Exploratory Data Analysis, and Machine Learning. More notebooks will be added as I learn things and devote time to write about them.
Visit my website or my Medium profile, where I include everything listed here and much more.
Below is a summary of them.
The code is located here.
- Use the pre-trained Hugging Face model
juliensimon/reviews-sentiment-analysis, fine-tuned for sentiment classification. - Perform inference on the test set with the base model, achieving an accuracy of 79%.
- Set Training Arguments: Configure hyperparameters like learning rate, batch size, and number of epochs.
- Train the Model: Fine-tune the pre-trained model using the Hugging Face
TrainerAPI. - Evaluation: Test the fine-tuned model, achieving an improved accuracy of 89%.
The code is located here.
Build a Retrieval-Augmented Generation (RAG) system to combine retrieval and generation capabilities, enabling precise, context-aware AI solutions for applications like chatbots, search engines, and document summarization.
- Fetch real-world data (e.g., NVIDIA's latest 10-K filing from SEC).
- Split the text into overlapping chunks to ensure context and continuity are preserved.
- Use SentenceTransformers (
all-mpnet-base-v2) to create 768-dimensional semantic embeddings for each chunk. - Capture meaningful relationships between query and text data.
- Store embeddings in FAISS, a high-speed vector similarity search library.
- Enable fast and scalable similarity-based searches across large datasets.
- Implement a retrieval pipeline that queries the FAISS index to return the most contextually relevant chunks.
- Efficiently handle user queries with top-k nearest neighbor search.
- Use a Hugging Face transformer model (e.g., Qwen2.5-1.5B-Instruct) for generating grounded answers.
- Design prompts with context and user queries to ensure accurate and factual responses.
The code is located here.
Learn how to efficiently retrieve and rank text to power state-of-the-art RAG systems, enabling context-aware AI applications like chatbots and document search.
- Use SentenceTransformers (
all-mpnet-base-v2) to embed text chunks as 768-dimensional vectors. - Leverage FAISS for efficient similarity search and fast retrieval.
- Implement overlapping chunks to maintain continuity and relevance.
- Ideal for narratives or technical documents requiring interdependent context.
- Rank documents using efficient keyword-based search for initial filtering.
- Combine with dense retrieval or reranking for improved precision.
- Use Hugging Face cross-encoders to rank query-document pairs by semantic relevance.
- Achieve context-aware retrieval with hybrid BM25 and reranking pipelines.
The code is located here.
How to optimize, guide, and control language models for precise, efficient, and application-ready outputs.
This project showcases how advanced prompt engineering techniques can transform model outputs:
- Role-based instructions: Utilize multi-component prompts with clear roles (
user,assistant,system) for context-rich conversations. - Structured outputs: Create JSON responses using advanced prompts for tasks like creating PokΓ©mon representations for real-world companies.
Controlling the randomness and structure of outputs is key in generative AI:
- Grammar enforcement: Generate and validate JSON outputs directly using grammar-constrained sampling.
- Applications: Ensure output reliability for tasks such as data generation, sentiment classification, and domain-specific profiles.
Efficiency meets performance through quantized models:
- Reduced memory footprint: Leverage LLaMAβs quantized models (e.g., Q2, Q6, fp16) for efficient resource utilization.
- Scalability: Run large models on limited hardware without compromising accuracy.
- Custom configurations: Experiment with precision levels to balance speed, accuracy, and computational cost.
Improve model comprehension and output relevance with in-context learning:
- Zero-shot, one-shot, and few-shot examples: Demonstrate tasks directly within prompts for enhanced accuracy.
- Real-world applications: Use contextual examples to generate creative and structured results for niche domains.
The code is located here and the related article here.
In this project, I compare traditional sentiment analysis methods with cutting-edge generative AI models to classify product reviews. The full project details, including code and evaluation metrics, are available in the accompanying article.
-
Dataset:
- Worked with the Flipkart Customer Review dataset to classify reviews as positive or negative.
- Excluded neutral reviews (3-star ratings) to focus on clear sentiment polarities.
-
Methods Explored:
- Logistic Regression with TF-IDF: A lightweight and interpretable baseline approach.
- Logistic Regression with Pretrained Embeddings: Leveraging advanced models like
all-MiniLM-L6-v2for semantic feature extraction. - Zero-Shot Classification: Using embeddings and cosine similarity to classify reviews without labeled data.
- Generative AI (Flan-T5): Fine-tuned generative models to generate sentiment labels based on prompts.
- Task-Specific Models: Employing fine-tuned models like
juliensimon/reviews-sentiment-analysisfor domain-specific performance.
-
Performance Evaluation:
- Compared methods based on accuracy, F1-score, computational cost, and the need for labeled data.
- Highlighted trade-offs between traditional and modern approaches.
The code is located here and the related article here.
- Use Wikipedia to grab movies and more specifically their Summaries and Plots
- Merge IMDb data with Wikipedia
- Build, Evaluate and Visualize an LDA model
The article is available on Towards Data Science and the code is located here.
- What is Outlier Detection?
- Causes
- Applications
- Approaches
- Taxonomy
- Algorithms - Isolation Forest, Extended Isolation Forest, Local Outlier Factor, DBSCAN, One Class SVM, Ensemble
This is a very popular kaggle kernel with more than 1250 upvotes and 80.000 views, with which I won the 1st prize for the best kernel in that Kaggle competition.
Two articles on Towards Data Science (Part 1, Part 2). Code is available here.
- What is a Time Series?
- The Basic Steps in a Forecasting Task
- Time Series Graphics (Time Plot, Seasonal Plot, Seasoonal Subseries Plot, Lag Scatter Plot)
- Time Series Components
- Stationarity
- Autocorrelation
- Moving Average, Double and Triple Exponential Smoothing
The task is to forecast, as precisely as possible, the unit sales (demand) of various products sold in the USA by Walmart. Competitors: Simple Exponential Smoothing, Double Exponential Smoothing, Triple Exponential Smoothing, ARIMA, SARIMA, SARIMAX, Light Gradient Boosting, Random Forest, Linear Regression.
The article is available on Towards Data Science and the code is located here.
This is a project that aims to help practicing some technologies and Data Science.
Let's suppose that you live in Toronto, Canada (you can do this for every city that has enough data) and you found a better job. This job is located in the other side of the city and you decide that you need to re-locate closer. You really like your neighborhood though, and you want to find a similar one.
This code uses the venues of each neighborhood as features in a clustering algorithm (k-means) and finds similar neighborhoods.
Things that were used
- Beautiful Soup - Package that lets us extract the content of a web page into simple text
- Json - Handle json files and transform them into a pandas dataframe
- Geocode - Package that converts an address to its coordinates
- Scikit Learn - Machine learning package in order to use clustering
- Folium - Package to create spatial maps. NOTE: Maps that are created from folium are not displayed in jupyter notebook. I provide links to them as static images.
Are you starting with Data Science? Pandas is perhaps the first best thing you will need. And it's really easy!
After reading (and practising) this tutorial you will learn how to:
- Create, add, remove and rename columns
- Read, select and filter data
- Retrieve statistics for data
- Sort and group data
- Manipulate data
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other and are widely used in data analysis to help the programmer to get some clue out of the raw data.
This notebook includes:
- Normalization
- Why normalize?
- Standardization
- Why standardization?
- Differences?
- When to use and when not
- Python code for Simple Feature Scaling, Min-Max, Z-score, log1p transformation
Python code on how to transform nominal and ordinal variables to integers.
This Notebook includes:
- Ordinal Encoding with LabelEncoder, Panda's Factorize, and Panda's Map
- Nominal Encoding with One-Hot Encoding and Binary Encoding
Every plot that seaborn provides is here with examples in a real dataset.
This notebook includes:
- Theory on Skewness and Kurtosis
- Univariate plots. [Histogram, KDE, Box plot, Count plot, Pie chart]
- Bivariate plots. [Scatter plot, Join plot, Reg plot, KDE plot, Hex plot, Line plot, Bar plot, Violin plot, Boxen plot, Strip plot]
- Multivariate plots. [Correlation Heatmap, Pair plot, Scatter plot, Line plot, Bar plot]
In this tutorial I present the datetime format that Pandas provides to handle datetime features. In the end I create a function that generates 23 features from a single one.