This project focuses on the security issues of Large Language Models (LLMs) in the context of malware-related tasks and jailbreak attacks. With the widespread adoption of LLMs, concerns about their security, especially vulnerability to jailbreak attacks that can lead to malicious outputs, have grown. While previous research has looked at general LLM security, the specific susceptibility to jailbreak attacks in code generation has not been fully explored.
- Propose MalwareBench, a benchmark dataset designed to evaluate the robustness of LLMs against jailbreak attacks for malicious code generation.
- Test 29 mainstream LLMs to analyze their security performance and influencing factors when faced with jailbreak attacks.
- Reveal the security vulnerabilities of current LLMs and provide important directions for subsequent improvements in model security.
- Composition: MalwareBench contains 3,520 jailbreaking prompts for malicious code generation. It is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.
- Categorization: When constructing the dataset, researchers referred to the malimg dataset. Malicious problems are categorized into 6 primary classifications based on user intent, with further secondary and tertiary classifications for some categories. For each detailed category, 5 - 20 malicious requirements were manually crafted, considering different operating systems and dividing requirements into rough and detailed types.
- Jailbreak Methods: 11 jailbreak methods of three types were selected, with Qwen-Turbo used in some for question generation.
- Rejection Rates: Initially, code-generation models had a 70.56% rejection rate and generic large models had 51.19% for malicious content. However, when jailbreak methods were applied, these rates dropped to 51.50% and 41.47% respectively. The overall average rejection rate for malicious content is 60.93%, which drops to 39.92% when combined with jailbreak attack algorithms.
- Model Performance:
- There is a negative correlation between response scores and refusal rates.
- About 50.35% of jailbreak attempts yield malicious content. Models like OpenAI-o1 and CodeLlama-70B-Instruct show strong security.
- LLMs are more defensive against detailed requirements. Small parameter models tend to give irrelevant outputs, and large models are more prone to generating malicious pseudo-code, often relying on their knowledge bases.
- Advanced reasoning models such as OpenAI-o1 and DeepSeek-R1 can handle malicious requests, but security alignment is crucial.
- Different models have varied sensitivities to attack algorithms, with Benign Expression being a highly effective jailbreaking method.
- Regarding requirement types, LLMs show a consistent trend, with low scores for Denial Service and Download&Propagation, and high scores for Information Theft, possibly due to training data or model mechanisms.
- Code: MAIL-Tele-AI/MalwareBench
- Huggingface: Tele-AI-MAIL/MalwareBench
- Original Dataset: original_questions.xlsx
- Attack Prompts: attack_prompts.xlsx
The test results of LLMs are located in the test_result folder.
@article{*****,
title={LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges},
author={Li, Haoyang and Gao, Huan and Zhao, Zhiyuan and Lin, Zhiyu and Gao, Junyu and Li, Xuelong},
conference={Proceedings of the ACL 2025 Conference},
year={2025}
}


