Skip to content

prabathjayatissa/Data-Science

Repository files navigation

🧬 Bioinformatics & Data Science Repository

A curated collection of data science projects, scripts, and analyses with a focus on bioinformatics.
This repository serves as a portfolio and resource for exploring computational biology, genomics, proteomics, and related fields through the lens of data science.


📁 Repository Structure

Data-Science/
│
├── datasets/               # Metadata or processed data (no large raw data)
├── notebooks/              # Jupyter notebooks for analysis and visualization
├── scripts/
│   ├── data_download/      # Automated data download & preprocessing scripts
│   ├── analysis/           # Python/R scripts for bioinformatics analyses
│   └── ml_models/          # Machine learning pipelines and model training
├── results/                # Output figures, reports, and tables
├── environment.yml         # Conda environment specification
├── requirements.txt        # Python dependencies (for pip users)
└── README.md               # This file

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • conda (recommended) or pip

Installation

  1. Clone the repository:

    git clone https://github.com/prabathjayatissa/Data-Science.git
    cd Data-Science
  2. Set up the environment (using Conda - Recommended):

    conda env create -f environment.yml
    conda activate bioinfo_ds_env  # Or whatever you name your environment in the YAML file

    Alternatively, using pip:

    pip install -r requirements.txt

📊 Datasets

This repository works with the following types of bioinformatics data.
Due to size constraints, raw data is often not stored in Git.

Dataset Description Source Size Link
TCGA-BRCA RNA-Seq and clinical data for Breast Cancer The Cancer Genome Atlas ~2 GB GDC Portal
Human Genome (GRCh38) Reference genome sequence GENCODE ~1 GB GENCODE
Example Microbiome 16S rRNA sequencing data from a mock community [Cite Source] ~50 MB DOI

Pro Tip:

  • Use tables for clarity.
  • Always provide a source and link (DOI if available).
  • Mention the size to set expectations.
  • For large datasets, provide scripts in scripts/data_download/ to automate fetching.

Data Download Scripts

Run the provided scripts from the repository root to fetch data:

# Example: Download and preprocess TCGA data
bash scripts/data_download/fetch_tcga_data.sh

# Example: Download a reference genome
python scripts/data_download/get_reference_genome.py

🧪 Projects & Analyses

Here are the key analyses and projects contained in this repository:

1. Differential Gene Expression Analysis

Description: Identifies significantly up/down-regulated genes between two biological conditions (e.g., tumor vs. normal) using RNA-Seq data.
Tools: DESeq2 (R) / limma-voom / Scanpy (for scRNA-Seq)
Key Output: Lists of DEGs, volcano plots, MA plots.


2. Variant Calling Pipeline

Description: A Snakemake/Nextflow pipeline for calling genetic variants from NGS data (e.g., WES, WGS).
Tools: FastQC, BWA-MEM, GATK, Samtools, Snakemake
Key Output: VCF files containing SNP and Indel calls.


3. Machine Learning for Protein Function Prediction

Description: Predicts protein function from amino acid sequences using machine learning.
Tools: scikit-learn, Biopython, PyTorch
Features: Amino acid composition, PSSM profiles, protein embeddings (ESM, etc.)
Models: Random Forest, XGBoost, Simple Neural Network.



🛠️ Tools & Technologies

🧰 Programming Languages

  • Python
  • R
  • Bash

🧬 Core Bioinformatics Libraries

  • Biopython, Pysam, PyVCF
  • DESeq2, limma, Bioconductor (R)
  • Scanpy, Scikit-bio

📈 Data Science & ML

  • pandas, numpy, scikit-learn, XGBoost
  • matplotlib, seaborn, plotly

⚙️ Workflow Management

  • Snakemake, Nextflow

🐳 Containers & Reproducibility

  • Conda, Docker

🤝 Contributing

Contributions are welcome!
If you have a new analysis, script, or suggestion for improvement, please:

  1. Fork the repo
  2. Create a feature branch
    git checkout -b feature/AmazingFeature
  3. Commit your changes
    git commit -m "Add some AmazingFeature"
  4. Push to your branch
    git push origin feature/AmazingFeature
  5. Open a Pull Request

Please ensure your code is well-documented and follows the existing style.


📜 License

This project is licensed under the MIT License — see the LICENSE file for details.

Note: The code in this repository is licensed under MIT.
However, datasets may have their own licensing terms — please check each source for compliance.


📚 References & Useful Links


If you find this repository useful, please consider giving it a star!

About

Data Science Tutorials

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published