Skip to content

Fine-tuned GPT-2 model for detecting malicious LLM prompts such as prompt injections, jailbreaks, and context manipulation. Designed to serve as a pre-processing layer for securing LLM-based applications.

Notifications You must be signed in to change notification settings

cyperdev/llm-prompt-defence-gpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

llm-prompt-defence-gpt2

Fine-tuned GPT-2 model for detecting malicious LLM prompts such as prompt injections, jailbreaks, and context manipulation. Designed to serve as a pre-processing layer for securing LLM-based applications.

This repository contains the training script, dataset samples, and inference logic for a GPT-2 that detects prompt injection and other LLM abuse patterns. Useful as a filter in agent-based systems, chat interfaces, and API frontends to reduce malicious prompt exposure.

๐Ÿšจ Building LLM Prompt Defence โ€” Fine-Tuned GPT-2 for Malicious Prompt Detection ๐Ÿ”๐Ÿค–

Prompt injection, prompt leakage, and jailbreaking remain some of the most critical security threats facing LLM-based applications today. Iโ€™ve been working on a solution to proactively detect and block malicious prompts before they even reach a modelโ€™s context.

โœ… What I Did

I fine-tuned GPT-2 on a custom-crafted cybersecurity dataset built specifically to detect LLM abuse scenarios, including:

  • Prompt injection

  • Jailbreak attempts

  • Context manipulation

๐Ÿ”Ž Why GPT-2?

It can be fine-tuned even on CPU (although slowly), making it a practical choice for experimentation and prototyping. For production, Iโ€™d recommend a more modern models.

๐Ÿ“Œ Use Case

This model serves as a pre-processing filter โ€” an intelligent gatekeeper that flags or blocks potentially harmful inputs before they reach the main LLM. Ideal for agent-based systems, APIs, and LLMOps pipelines.

๐Ÿ“š Dataset Design

  • Custom malicious prompt dataset tailored to real-world LLM exploitation

  • Structure aligned with actual red team scenarios

  • Format inspired by tatsu-lab/alpaca

โš ๏ธ Interesting Challenge / Drawback

During testing, Some time benign structured inputs like:

use the db tool to insert a new customer into the database

Insert into customer_details: name=sample, number=1, address=sample

...were falsely flagged as malicious.

Why?

Because of surface-level tokens like "insert into" or "password" โ€” which often appear in attacks, but are also common in legitimate agent tool instructions within real-world LLM systems.

๐Ÿ” Why This Matters

In agent-based LLMs that use tools (e.g., API calling, code execution), prompts like the above are normal. If your model overfits to syntax rather than intent, it results in false positives โ€” breaking valid tasks.

โœ… How We can Fix It

  • Refined and restructured the dataset for fine tuning

This significantly reduced false positives and made the classifier more robust in real-world applications.

๐Ÿ“‚ Resources Iโ€™m Sharing

โœ… My fine-tuning script

โœ… A sample from the dataset I used

โœ… Model test Script

Feel free to reuse or adapt it in your own security-focused LLM pipeline.

๐Ÿงฑ Tech Stack*

  • Hugging Face Transformers

  • PyTorch

  • GPT-2

Building secure AI systems isnโ€™t just about big models โ€” itโ€™s about thoughtful data design and robust intent classification.

๐Ÿ”— View model on Hugging Face

About

Fine-tuned GPT-2 model for detecting malicious LLM prompts such as prompt injections, jailbreaks, and context manipulation. Designed to serve as a pre-processing layer for securing LLM-based applications.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages