🧬 Bioinformatics & Data Science Repository

A curated collection of data science projects, scripts, and analyses with a focus on bioinformatics.
This repository serves as a portfolio and resource for exploring computational biology, genomics, proteomics, and related fields through the lens of data science.

📁 Repository Structure

Data-Science/
│
├── datasets/               # Metadata or processed data (no large raw data)
├── notebooks/              # Jupyter notebooks for analysis and visualization
├── scripts/
│   ├── data_download/      # Automated data download & preprocessing scripts
│   ├── analysis/           # Python/R scripts for bioinformatics analyses
│   └── ml_models/          # Machine learning pipelines and model training
├── results/                # Output figures, reports, and tables
├── environment.yml         # Conda environment specification
├── requirements.txt        # Python dependencies (for pip users)
└── README.md               # This file

🚀 Quick Start

Prerequisites

Python 3.8+
conda (recommended) or pip

Installation

Clone the repository:

git clone https://github.com/prabathjayatissa/Data-Science.git
cd Data-Science

Set up the environment (using Conda - Recommended):

conda env create -f environment.yml
conda activate bioinfo_ds_env  # Or whatever you name your environment in the YAML file

Alternatively, using pip:

pip install -r requirements.txt

📊 Datasets

This repository works with the following types of bioinformatics data.
Due to size constraints, raw data is often not stored in Git.

Dataset	Description	Source	Size	Link
TCGA-BRCA	RNA-Seq and clinical data for Breast Cancer	The Cancer Genome Atlas	~2 GB	GDC Portal
Human Genome (GRCh38)	Reference genome sequence	GENCODE	~1 GB	GENCODE
Example Microbiome	16S rRNA sequencing data from a mock community	[Cite Source]	~50 MB

Pro Tip:

Use tables for clarity.

Always provide a source and link (DOI if available).

Mention the size to set expectations.

For large datasets, provide scripts in scripts/data_download/ to automate fetching.

Data Download Scripts

Run the provided scripts from the repository root to fetch data:

# Example: Download and preprocess TCGA data
bash scripts/data_download/fetch_tcga_data.sh

# Example: Download a reference genome
python scripts/data_download/get_reference_genome.py

🧪 Projects & Analyses

Here are the key analyses and projects contained in this repository:

1. Differential Gene Expression Analysis

Description: Identifies significantly up/down-regulated genes between two biological conditions (e.g., tumor vs. normal) using RNA-Seq data.
Tools: DESeq2 (R) / limma-voom / Scanpy (for scRNA-Seq)
Key Output: Lists of DEGs, volcano plots, MA plots.

2. Variant Calling Pipeline

Description: A Snakemake/Nextflow pipeline for calling genetic variants from NGS data (e.g., WES, WGS).
Tools: FastQC, BWA-MEM, GATK, Samtools, Snakemake
Key Output: VCF files containing SNP and Indel calls.

3. Machine Learning for Protein Function Prediction

Description: Predicts protein function from amino acid sequences using machine learning.
Tools: scikit-learn, Biopython, PyTorch
Features: Amino acid composition, PSSM profiles, protein embeddings (ESM, etc.)
Models: Random Forest, XGBoost, Simple Neural Network.

🛠️ Tools & Technologies

🧰 Programming Languages

Python
R
Bash

🧬 Core Bioinformatics Libraries

Biopython, Pysam, PyVCF
DESeq2, limma, Bioconductor (R)
Scanpy, Scikit-bio

📈 Data Science & ML

pandas, numpy, scikit-learn, XGBoost
matplotlib, seaborn, plotly

⚙️ Workflow Management

Snakemake, Nextflow

🐳 Containers & Reproducibility

Conda, Docker

🤝 Contributing

Contributions are welcome!
If you have a new analysis, script, or suggestion for improvement, please:

Fork the repo
Create a feature branch
```
git checkout -b feature/AmazingFeature
```
Commit your changes
```
git commit -m "Add some AmazingFeature"
```
Push to your branch
```
git push origin feature/AmazingFeature
```
Open a Pull Request

Please ensure your code is well-documented and follows the existing style.

📜 License

This project is licensed under the MIT License — see the LICENSE file for details.

Note: The code in this repository is licensed under MIT.
However, datasets may have their own licensing terms — please check each source for compliance.

📚 References & Useful Links

Biostars Handbook — A fantastic resource for bioinformatics.
ROSALIND — A platform for learning bioinformatics through problem-solving.
GATK Best Practices — For variant discovery pipelines.
GEO Database — Functional genomics data repository.

⭐ If you find this repository useful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data_Science_Tutorial_01_Auto_vehicle.csv		Data_Science_Tutorial_01_Auto_vehicle.csv
Introduction_to_Data_Science.ipynb		Introduction_to_Data_Science.ipynb
README.md		README.md
bioinformatics-basics.ipynb		bioinformatics-basics.ipynb
bioinformatics-basics_2.ipynb		bioinformatics-basics_2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Bioinformatics & Data Science Repository

📁 Repository Structure

🚀 Quick Start

Prerequisites

Installation

📊 Datasets

Data Download Scripts

🧪 Projects & Analyses

1. Differential Gene Expression Analysis

2. Variant Calling Pipeline

3. Machine Learning for Protein Function Prediction

🛠️ Tools & Technologies

🧰 Programming Languages

🧬 Core Bioinformatics Libraries

📈 Data Science & ML

⚙️ Workflow Management

🐳 Containers & Reproducibility

🤝 Contributing

📜 License

📚 References & Useful Links

About

Uh oh!

Releases

Packages

Languages

prabathjayatissa/Data-Science

Folders and files

Latest commit

History

Repository files navigation

🧬 Bioinformatics & Data Science Repository

📁 Repository Structure

🚀 Quick Start

Prerequisites

Installation

📊 Datasets

Data Download Scripts

🧪 Projects & Analyses

1. Differential Gene Expression Analysis

2. Variant Calling Pipeline

3. Machine Learning for Protein Function Prediction

🛠️ Tools & Technologies

🧰 Programming Languages

🧬 Core Bioinformatics Libraries

📈 Data Science & ML

⚙️ Workflow Management

🐳 Containers & Reproducibility

🤝 Contributing

📜 License

📚 References & Useful Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages