Lydia Nishimwe lydianish

Hi, I'm Lydia Nishimwe 👋

I recently completed my PhD in Natural Language Processing (NLP) and am currently seeking a Research Scientist or Machine Learning Engineer position (on-site in Paris or remote). I specialize in robust neural machine translation, sentence embeddings, and NLP for user-generated content. I am particularly interested in working on generative AI, NLP for low-resource languages, and multimodality.

🛠️ Skills & Expertise

Machine Learning & NLP: Neural Machine Translation, Transformers, Large Language Models (LLMs), Sentence Embeddings, Lexical Normalization
Research & Development: Knowledge distillation, data augmentation, evaluation of NLP systems, multilingual NLP, large-scale distributed training on multiple GPUs/nodes
Programming & Tools: Python, PyTorch, Hugging Face Transformers, Fairseq, SLURM, Git, Linux, Docker, TypeScript

🚀 Featured Projects

These are among the repositories I’ve highlighted on my GitHub profile — check them out for code, demos, and research results.

🔹 RoLASER

My PhD research work to make LASER more robust to User‑Generated Content (UGC).
Includes robust sentence embeddings and UGC data generation via augmentation. Paper published at LREC-COLING 2024 conference; model released on Hugging Face.

🔁 Forked and Contributed Projects (Used in Research)

These are important open‑source toolkits I’ve forked, used and contributed to during my PhD research:

📌 fairseq

Forked from Meta’s sequence‑to‑sequence toolkit and used extensively for NMT experiments. Contributed bug fixes and enhancements (e.g., dictionary handling improvements). Read my blog about the bug here.

📌 NL‑Augmenter

Forked the transformation library, used it to generate artificial UGC for data augmentation, and contributed bug fixes and new features.

📌 LASER

Forked from the original LASER (Language‑Agnostic SEntence Representations) for use and extension in my research. Wrote evaluation scripts for evaluating on new models (RoLASER) and tasks (Massive Text Embedding Benchmark - MTEB).

📌 SONAR

Forked the SONAR repository to use its text embedding and translation capabilities in my work (text‑only use case, not speech). Used to extend the RoLASER approach to RoSONAR.

🌐 Prior Collaborative / Hackathon Projects

💻 MT Challenger Frontend Legacy

Built the front-end of a data augmentation tool for Machine Translation during the 3-day online Unbabel MT Half-Marathon 2021.

🤝 Social Relief

Collaborative project for transparent distribution of Covid-19 relief funds in Kenya. Focused on system implementation and workflow improvements.

🧬 BRAG

Developed BRAG (Biomedical RAkinG), a cross-platform tool that aggregates bibliographic data from sources like PubMed and Google Scholar to summarise researchers’ scientific output, including publications, citations, h-index, and optional graphical representations. This was a school project at Centrale Nantes.

📂 PhD Work Organization

All my PhD-related repositories are grouped under my GitHub organization: lydianish-phd.

Work in progress: still migrating repositories from my lab's private GitLab.

📫 Connect with Me

I am passionate about building robust NLP systems and contributing to open-source research. Feel free to explore my repositories and reach out for collaborations or opportunities!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly