Skip to content
View lydianish's full-sized avatar
💼
Open To Work
💼
Open To Work

Block or report lydianish

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
lydianish/README.md

Hi, I'm Lydia Nishimwe 👋

I recently completed my PhD in Natural Language Processing (NLP) and am currently seeking a Research Scientist or Machine Learning Engineer position (on-site in Paris or remote). I specialize in robust neural machine translation, sentence embeddings, and NLP for user-generated content. I am particularly interested in working on generative AI, NLP for low-resource languages, and multimodality.


🛠️ Skills & Expertise

  • Machine Learning & NLP: Neural Machine Translation, Transformers, Large Language Models (LLMs), Sentence Embeddings, Lexical Normalization
  • Research & Development: Knowledge distillation, data augmentation, evaluation of NLP systems, multilingual NLP, large-scale distributed training on multiple GPUs/nodes
  • Programming & Tools: Python, PyTorch, Hugging Face Transformers, Fairseq, SLURM, Git, Linux, Docker, TypeScript

🚀 Featured Projects

These are among the repositories I’ve highlighted on my GitHub profile — check them out for code, demos, and research results.

🔹 RoLASER

My PhD research work to make LASER more robust to User‑Generated Content (UGC).
Includes robust sentence embeddings and UGC data generation via augmentation. Paper published at LREC-COLING 2024 conference; model released on Hugging Face.


🔁 Forked and Contributed Projects (Used in Research)

These are important open‑source toolkits I’ve forked, used and contributed to during my PhD research:

📌 fairseq

Forked from Meta’s sequence‑to‑sequence toolkit and used extensively for NMT experiments. Contributed bug fixes and enhancements (e.g., dictionary handling improvements). Read my blog about the bug here.

Forked the transformation library, used it to generate artificial UGC for data augmentation, and contributed bug fixes and new features.

📌 LASER

Forked from the original LASER (Language‑Agnostic SEntence Representations) for use and extension in my research. Wrote evaluation scripts for evaluating on new models (RoLASER) and tasks (Massive Text Embedding Benchmark - MTEB).

📌 SONAR

Forked the SONAR repository to use its text embedding and translation capabilities in my work (text‑only use case, not speech). Used to extend the RoLASER approach to RoSONAR.


🌐 Prior Collaborative / Hackathon Projects

Built the front-end of a data augmentation tool for Machine Translation during the 3-day online Unbabel MT Half-Marathon 2021.

Collaborative project for transparent distribution of Covid-19 relief funds in Kenya. Focused on system implementation and workflow improvements.

🧬 BRAG

Developed BRAG (Biomedical RAkinG), a cross-platform tool that aggregates bibliographic data from sources like PubMed and Google Scholar to summarise researchers’ scientific output, including publications, citations, h-index, and optional graphical representations. This was a school project at Centrale Nantes.


📂 PhD Work Organization

All my PhD-related repositories are grouped under my GitHub organization: lydianish-phd.

Work in progress: still migrating repositories from my lab's private GitLab.


📫 Connect with Me


I am passionate about building robust NLP systems and contributing to open-source research. Feel free to explore my repositories and reach out for collaborations or opportunities!

Pinned Loading

  1. lydianish-phd/RoLASER lydianish-phd/RoLASER Public

    A Robust LASER sentence encoder for English User-Generated Content

    Python 2

  2. fairseq fairseq Public

    Forked from facebookresearch/fairseq

    Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

    Python

  3. NL-Augmenter NL-Augmenter Public

    Forked from GEM-benchmark/NL-Augmenter

    NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

    Python

  4. alphamanuscript/social-relief alphamanuscript/social-relief Public

    Transparent and reliable distribution of funds from individual donors to people affected by Covid-19 in Kenya

    TypeScript 1

  5. mt-challenge-generator/mt_challenger_frontend_legacy mt-challenge-generator/mt_challenger_frontend_legacy Public

    Vue 1

  6. brag brag Public

    A tool for Biomedical Ranking

    JavaScript 1