Skip to content
View donphi's full-sized avatar
:octocat:
πŸ˜΄β˜•πŸ’Šβ˜•πŸ˜΄... πŸ”
:octocat:
πŸ˜΄β˜•πŸ’Šβ˜•πŸ˜΄... πŸ”

Block or report donphi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
donphi/README.md

πŸš€ Research Engineer (AI, Large-Scale Data Systems)

Survivor of a combined double-lung and liver transplant (141st globally at Penn Medicine). This experience reshaped my priorities: I now build infrastructure and tooling that strengthen scientific reliability and accelerate discovery.

My background spans design, high-stakes operations, and 12 years of leadership in finance. Today, I focus on AI systems engineering: large-scale multimodal pipelines, reproducible ML workflows, and research-enabling infrastructure for biomedical science. I am completing an MSc in Artificial Intelligence & Digital Health while developing this high-throughput research system.

I thrive on hard technical problems, high ambiguity, and complex data flows. My focus is the system: building the foundations that make scientific and ML workflows robust, scalable, and trustworthy.


πŸ”­ Core Project: (MSc) End-to-End Scientific Data System

I am currently building a single, cohesive system to process, index, and validate the UK Biobank research corpus. It is composed of three primary layers:

1. The Ingestion Layer: 23-Stage Multimodal Pipeline

A GPU-aware orchestration pipeline (Prefect, RabbitMQ) to convert raw PDFs into structured JSON. It integrates Docling, Marker, LayoutLMv3, and PaddleOCR and is built with hardware-aware profiles for DGX B200 and RTX 6000-class GPUs.

2. The Knowledge Base Layer: Ontology & Indexing

A data engineering pipeline that builds a searchable knowledge base. It parses raw MONDO/HPO/EFO ontology files into a DuckDB graph, and a parallel-processing ETL script indexes semantically-chunked documents into Elasticsearch.

3. The Application Layer: HITL Alias Miner & Validation

An ML-driven Flask application that validates scientific features. It uses SapBERT embeddings to find and score potential aliases from the corpus, then presents them to a human reviewer. The result is a high-quality, verified feature_cards.jsonl dataset for downstream model training.


πŸ› οΈ Technical Focus & Tooling

Core Areas

  • Large-scale multimodal data systems
  • Distributed compute & GPU-optimised pipelines
  • Deterministic research workflows
  • Ontology-driven feature extraction
  • Human-in-the-Loop (HITL) validation pipelines
  • Reproducible ML infrastructure

Technologies

  • Python, R
  • PyTorch, vLLM, Hugging Face, Lifelines, SHAP
  • Docling, Marker, PaddleOCR, LayoutLMv3
  • FAISS, Elasticsearch, PostgreSQL, DuckDB
  • Polars, Pandas, NumPy
  • Prefect, RabbitMQ, Flask
  • Docker (GPU), CUDA 12.x, Linux
  • React, D3.js, CI/CD

🌱 Currently Deepening

  • Low-rank adaptation (LoRA) and domain-specific fine-tuning
  • Analyzing research on latent space dimensionality and its connection to model capabilities (e.g., language context vs. multi-modal world modeling).
  • ML systems design for biomedical automation
  • Advanced statistical modelling (Statistics II)

🌐 Website

https://donphi.work

πŸ“« Connect

LinkedIn: https://www.linkedin.com/in/donphi

Pinned Loading

  1. human-loop-hypothesis human-loop-hypothesis Public

    HITL Validation Pipeline for Feature Extraction

    Python

  2. dodo dodo Public

    UKB Phenotype Viewer (Radial + Tidy Trees)

    TypeScript

  3. draftmerch draftmerch Public

    Serverless Image Generator (AWS + OpenAI)

    Python

  4. gatehouse gatehouse Public

    A Python library that enforces code schema compliance in LLM-generated Python through error-driven feedback loops

    Python

  5. cv-resume-ats-latex cv-resume-ats-latex Public

    Dual-format LaTeX CV system β€” one designed version for humans, one ATS-optimised version for machines. Single source of truth, two outputs.

    TeX 1