A curated collection of data science projects, scripts, and analyses with a focus on bioinformatics.
This repository serves as a portfolio and resource for exploring computational biology, genomics, proteomics, and related fields through the lens of data science.
Data-Science/
│
├── datasets/ # Metadata or processed data (no large raw data)
├── notebooks/ # Jupyter notebooks for analysis and visualization
├── scripts/
│ ├── data_download/ # Automated data download & preprocessing scripts
│ ├── analysis/ # Python/R scripts for bioinformatics analyses
│ └── ml_models/ # Machine learning pipelines and model training
├── results/ # Output figures, reports, and tables
├── environment.yml # Conda environment specification
├── requirements.txt # Python dependencies (for pip users)
└── README.md # This file
- Python 3.8+
conda(recommended) orpip
-
Clone the repository:
git clone https://github.com/prabathjayatissa/Data-Science.git cd Data-Science -
Set up the environment (using Conda - Recommended):
conda env create -f environment.yml conda activate bioinfo_ds_env # Or whatever you name your environment in the YAML fileAlternatively, using pip:
pip install -r requirements.txt
This repository works with the following types of bioinformatics data.
Due to size constraints, raw data is often not stored in Git.
| Dataset | Description | Source | Size | Link |
|---|---|---|---|---|
| TCGA-BRCA | RNA-Seq and clinical data for Breast Cancer | The Cancer Genome Atlas | ~2 GB | GDC Portal |
| Human Genome (GRCh38) | Reference genome sequence | GENCODE | ~1 GB | GENCODE |
| Example Microbiome | 16S rRNA sequencing data from a mock community | [Cite Source] | ~50 MB |
Pro Tip:
- Use tables for clarity.
- Always provide a source and link (DOI if available).
- Mention the size to set expectations.
- For large datasets, provide scripts in
scripts/data_download/to automate fetching.
Run the provided scripts from the repository root to fetch data:
# Example: Download and preprocess TCGA data
bash scripts/data_download/fetch_tcga_data.sh
# Example: Download a reference genome
python scripts/data_download/get_reference_genome.pyHere are the key analyses and projects contained in this repository:
Description: Identifies significantly up/down-regulated genes between two biological conditions (e.g., tumor vs. normal) using RNA-Seq data.
Tools: DESeq2 (R) / limma-voom / Scanpy (for scRNA-Seq)
Key Output: Lists of DEGs, volcano plots, MA plots.
Description: A Snakemake/Nextflow pipeline for calling genetic variants from NGS data (e.g., WES, WGS).
Tools: FastQC, BWA-MEM, GATK, Samtools, Snakemake
Key Output: VCF files containing SNP and Indel calls.
Description: Predicts protein function from amino acid sequences using machine learning.
Tools: scikit-learn, Biopython, PyTorch
Features: Amino acid composition, PSSM profiles, protein embeddings (ESM, etc.)
Models: Random Forest, XGBoost, Simple Neural Network.
- Python
- R
- Bash
- Biopython, Pysam, PyVCF
- DESeq2, limma, Bioconductor (R)
- Scanpy, Scikit-bio
- pandas, numpy, scikit-learn, XGBoost
- matplotlib, seaborn, plotly
- Snakemake, Nextflow
- Conda, Docker
Contributions are welcome!
If you have a new analysis, script, or suggestion for improvement, please:
- Fork the repo
- Create a feature branch
git checkout -b feature/AmazingFeature
- Commit your changes
git commit -m "Add some AmazingFeature" - Push to your branch
git push origin feature/AmazingFeature
- Open a Pull Request
Please ensure your code is well-documented and follows the existing style.
This project is licensed under the MIT License — see the LICENSE file for details.
Note: The code in this repository is licensed under MIT.
However, datasets may have their own licensing terms — please check each source for compliance.
- Biostars Handbook — A fantastic resource for bioinformatics.
- ROSALIND — A platform for learning bioinformatics through problem-solving.
- GATK Best Practices — For variant discovery pipelines.
- GEO Database — Functional genomics data repository.
⭐ If you find this repository useful, please consider giving it a star!