I recently completed my PhD in Natural Language Processing (NLP) and am currently seeking a Research Scientist or Machine Learning Engineer position (on-site in Paris or remote). I specialize in robust neural machine translation, sentence embeddings, and NLP for user-generated content. I am particularly interested in working on generative AI, NLP for low-resource languages, and multimodality.
- Machine Learning & NLP: Neural Machine Translation, Transformers, Large Language Models (LLMs), Sentence Embeddings, Lexical Normalization
- Research & Development: Knowledge distillation, data augmentation, evaluation of NLP systems, multilingual NLP, large-scale distributed training on multiple GPUs/nodes
- Programming & Tools: Python, PyTorch, Hugging Face Transformers, Fairseq, SLURM, Git, Linux, Docker, TypeScript
These are among the repositories I’ve highlighted on my GitHub profile — check them out for code, demos, and research results.
🔹 RoLASER
My PhD research work to make LASER more robust to User‑Generated Content (UGC).
Includes robust sentence embeddings and UGC data generation via augmentation.
Paper published at LREC-COLING 2024 conference; model released on Hugging Face.
These are important open‑source toolkits I’ve forked, used and contributed to during my PhD research:
📌 fairseq
Forked from Meta’s sequence‑to‑sequence toolkit and used extensively for NMT experiments. Contributed bug fixes and enhancements (e.g., dictionary handling improvements). Read my blog about the bug here.
Forked the transformation library, used it to generate artificial UGC for data augmentation, and contributed bug fixes and new features.
📌 LASER
Forked from the original LASER (Language‑Agnostic SEntence Representations) for use and extension in my research. Wrote evaluation scripts for evaluating on new models (RoLASER) and tasks (Massive Text Embedding Benchmark - MTEB).
📌 SONAR
Forked the SONAR repository to use its text embedding and translation capabilities in my work (text‑only use case, not speech). Used to extend the RoLASER approach to RoSONAR.
Built the front-end of a data augmentation tool for Machine Translation during the 3-day online Unbabel MT Half-Marathon 2021.
Collaborative project for transparent distribution of Covid-19 relief funds in Kenya. Focused on system implementation and workflow improvements.
🧬 BRAG
Developed BRAG (Biomedical RAkinG), a cross-platform tool that aggregates bibliographic data from sources like PubMed and Google Scholar to summarise researchers’ scientific output, including publications, citations, h-index, and optional graphical representations. This was a school project at Centrale Nantes.
All my PhD-related repositories are grouped under my GitHub organization: lydianish-phd.
Work in progress: still migrating repositories from my lab's private GitLab.
I am passionate about building robust NLP systems and contributing to open-source research. Feel free to explore my repositories and reach out for collaborations or opportunities!
