Single cells. Whole genomes. Drug targets. Clinical records. From raw data to decisions — end to end.
I am a bioinformatician, data scientist, and statistical modeler with 4+ years of experience. I build things that scale, run analyses that hold up under scrutiny, and communicate findings that people actually understand and use.
The data I work with: single-cell RNA-seq · bulk RNA-seq · whole genome & exome · proteomics · HTS compound screens · clinical records · multi-omics
The problems I solve: disease subtype classification · drug target identification · somatic variant detection · pathway dysregulation · compound hit prioritization · pipeline bottlenecks
The standard I hold: reproducible · statistically rigorous · biologically interpretable · production-ready
Raw Data Insight
──────── ───────
FASTQ files ──▶ QC → Alignment → Variant Calling → Annotation
scRNA-seq counts ──▶ Normalization → Clustering → Cell Type ID → DEGs
Mass spec outputs ──▶ LFQ → Protein quantification → Multi-omics fusion
HTS screening data ──▶ Plate QC → Dose-response → IC50 → Hit ID
Clinical records ──▶ Cleaning → Modeling → Statistical reports
All of the above ──▶ Machine learning → Prediction → SHAP → Publication
|
Building end-to-end NGS pipelines that actually run in production. Snakemake workflows on SLURM HPC and AWS Batch. scRNA-seq with Seurat and Scanpy. Somatic variant calling with GATK Mutect2. Differential expression with DESeq2. 30% faster. 26% more throughput. Not by accident — by design. Tools: Seurat · Scanpy · DESeq2 · STAR · BWA-MEM · GATK · Mutect2 · samtools · bcftools · Snakemake · BaseSpace |
Machine learning on biological data — where the features are genes, the labels are disease subtypes, and the stakes are real. Random Forest classifier at AUC 0.85. SHAP-based feature importance. 5-fold cross-validation. UMAP and PCA for dimensionality reduction. Models that generalize, not just memorize. Tools: R · Python · randomForest · caret · xgboost · SHAPforxgboost · scikit-learn · pROC |
|
Statistics as the backbone of every analysis — not an afterthought. FDR correction on differential expression. LASSO for feature selection. Multivariate regression on clinical datasets. Proper experimental design. Cross-validation that means something. Results rigorous enough to publish and clear enough for a clinician to act on. Tools: R (primary) · DESeq2 · LASSO · logistic regression · PCA · FDR · Biostatistics |
Raw data is almost never clean. Clinical records, formulation datasets, structured databases — every dataset has problems. SQL for querying and validation. Python scripts for integrity checks. QC protocols that catch errors before they propagate. Normalization, transformation, and reporting that R&D teams can trust. Tools: SQL · Python · R · pandas · dplyr · ggplot2 · tidyr · Excel |
Single-cell RNA-seq pipeline + Machine Learning classifier for HPV+ cancer cohorts
A complete analysis workflow — from raw count matrices to a validated Random Forest subtype classifier. Built in R using Seurat for preprocessing, DESeq2 for differential expression, and SHAP for model interpretability.
QC & Filtering → Normalization → UMAP Clustering → FDR-controlled DEGs
→ Feature Matrix (top 200 DEGs + TMB scores)
→ Random Forest (5-fold CV) → AUC 0.85 → SHAP Feature Importance
R Seurat DESeq2 randomForest SHAP pROC ggplot2
Production-grade NGS pipeline: somatic variant detection + transcriptomic analysis
A Snakemake workflow covering the full NGS stack — QC, alignment, duplicate marking, base recalibration, Mutect2 variant calling, and DESeq2 differential expression. Deployable on SLURM HPC and AWS Batch. Achieves 30% runtime reduction through parallelization and optimized resource allocation.
FastQC → Trimmomatic → BWA-MEM → MarkDuplicates → BQSR
→ Mutect2 → FilterMutectCalls → bcftools merge
→ featureCounts → DESeq2 → MultiQC report
Snakemake GATK4 BWA-MEM Mutect2 DESeq2 AWS SLURM Docker
HTS compound screening pipeline for proximity-based drug discovery (PROTACs, molecular glues)
An R-based pipeline for processing high-throughput screening data — plate QC (Z' factor, SSMD), 4-parameter log-logistic dose-response curve fitting, IC50 estimation with 95% CI, hit identification, and downstream integration with transcriptomic and proteomic data for multi-omics candidate prioritization.
Plate QC (Z' factor) → 4PL Dose-Response Fitting → IC50 Estimation
→ Hit Identification → Transcriptomic + Proteomic Integration
→ Composite Prioritization Score → Ranked Candidate List
R drc ggplot2 DESeq2 LFQ proteomics pheatmap
languages <- c("R (primary)", "Python", "SQL", "Bash", "Linux")
omics <- c("scRNA-seq", "Bulk RNA-seq", "WGS", "WES",
"Proteomics (MS/LFQ)", "Somatic Variants",
"TMB", "CNV", "Clinical Data", "HTS Screening")
ml_stats <- c("Random Forest", "LASSO", "Logistic Regression",
"PCA", "UMAP", "SHAP", "Cross-Validation",
"FDR Correction", "Differential Expression",
"Multivariate Regression", "Dimensionality Reduction")
bio_tools <- c("Seurat", "Scanpy", "DESeq2", "STAR", "BWA-MEM",
"GATK4", "Mutect2", "samtools", "bcftools",
"IGV", "Snakemake", "BaseSpace", "featureCounts")
infra <- c("SLURM HPC", "AWS S3", "AWS EC2", "AWS Batch",
"Docker", "Git", "GitHub")| Metric | Value |
|---|---|
| Years of experience | 4+ |
| Random Forest classifier AUC | 0.85 |
| Pipeline throughput improvement | +26% |
| Pipeline runtime reduction | -30% |
| Omics data types worked with | 8+ |
| Programming languages | 4 |
| GPA (M.S. Bioinformatics) | 3.8 / 4.0 |
🎓 M.S. Bioinformatics Data Science · University of Delaware · GPA 3.8 · Expected May 2026 Systems Biology · Biostatistics · Machine Learning · Bioinformatics · Electronic Health Records
🎓 B.Pharm · Dr. Babasaheb Ambedkar Technological University · 2022
roles = [
"Bioinformatics Analyst",
"Senior Bioinformatics Analyst",
"Computational Biologist",
"Data Scientist (Biotech / Pharma / Healthtech)",
"Statistical Analyst / Biostatistician",
"Clinical Data Analyst"
]
domains = [
"Drug Discovery",
"Oncology",
"Rare Disease",
"Genomics",
"Multi-Omics",
"Clinical Research"
]If your team works in any of these spaces — let's talk.
📧 shivanip8369@gmail.com · LinkedIn
"Three years of working with data that does not cooperate has a way of building a very specific skill set."