Convert Figma logo to code with AI

yueliu1999 logoAwesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

1,269
103
1,269
0

Quick Overview

Error generating quick overview

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. Any additional things regarding jailbreak, PRs, issues are welcome and we are glad to add you to the contributor list here. Any problems, please contact yliu@u.nus.edu. If you find this repository useful to your research or work, it is really appreciated to star this repository and cite our papers here. :sparkles:

Reference

If you find this repository helpful for your research, we would greatly appreciate it if you could cite our papers. :sparkles:

@article{zhuzhenhao_GuardReasoner_Omni,
  title={GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video},
  author={Zhu, Zhenhao and Liu, Yue and Guo, Yanpei and Qu, Wenjie and Chen, Cancan and He, Yufei and Li, Yibo and Chen, Yulin and Wu, Tianyi and Xu, Huiying and others},
  journal={arXiv preprint arXiv:2602.03328},
  year={2026}
}

@article{liuyue_GuardReasoner_VL,
  title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
  author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2505.11049},
  year={2025}
}

@article{liuyue_GuardReasoner,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2501.18492},
  year={2025}
}

@article{liuyue_FlipAttack,
  title={FlipAttack: Jailbreak LLMs via Flipping},
  author={Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan},
  journal={arXiv preprint arXiv:2410.02832},
  year={2024}
}

@article{wang2025safety,
  title={Safety in Large Reasoning Models: A Survey},
  author={Wang, Cheng and Liu, Yue and Li, Baolong and Zhang, Duzhen and Li, Zhongzhi and Fang, Junfeng},
  journal={arXiv preprint arXiv:2504.17704},
  year={2025}
}

Bookmarks

Papers

Jailbreak Attack

Attack on LRMs

TimeTitleVenuePaperCode
2025.11BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language ModelsAAAI'26link-
2025.08Jinx: Unlimited LLMs for Probing Alignment FailuresarXivlinkmodels
2025.07BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or ProfitarXivlinklink
2025.06ExtendAttack: Attacking Servers of LRMs via Extending ReasoningAAAI'26linklink
2025.06Excessive Reasoning Attack on Reasoning LLMsarXivlink-
2025.03Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning ModelsarXivlink-
2025.02OverThink: Slowdown Attacks on Reasoning LLMsarXivlinklink
2025.02BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor AttackarXivlinklink
2025.02H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash ThinkingarXivlinklink
2025.02A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative ChaosarXivlink-

Black-box Attack

TimeTitleVenuePaperCode
2025.10BreakFun: Jailbreaking LLMs via Schema ExploitationarXivlink-
2025.07Response Attack: Exploiting Contextual Priming to Jailbreak Large Language ModelsarXivlinklink
2025.05Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM DetectionICML'25linklink
2025.05FlipAttack: Jailbreak LLMs via Flipping (FlipAttack)ICML'25linklink
2025.03Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy (JOOD)CVPR'25linklink
2025.02StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsarXivlinklink
2025.01Understanding and Enhancing the Transferability of Jailbreaking AttacksICLR'25linklink
2024.11The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language ModelsarXivlinklink
2024.11Playing Language Game with LLMs Leads to JailbreakingarXivlinklink
2024.11GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs (GASP)arXivlinklink
2024.11LLM STINGER: Jailbreaking LLMs using RL fine-tuned LLMsarXivlink-
2024.11SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential PromptarXivlinklink
2024.11Diversity Helps Jailbreak Large Language ModelsarXivlink-
2024.11Plentiful Jailbreaks with String CompositionsarXivlink-
2024.11Transferable Ensemble Black-box Jailbreak Attacks on Large Language ModelsarXivlinklink
2024.11Stealthy Jailbreak Attacks on Large Language Models via Benign Data MirroringarXivlink-
2024.10Endless Jailbreaks with BijectionarXivlink-
2024.10Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language ModelsarXivlink-
2024.10You Know What I'm Saying: Jailbreak Attack via Implicit ReferencearXivlinklink
2024.10Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt TranslationarXivlinklink
2024.10AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (AutoDAN-Turbo)arXivlinklink
2024.10PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach (PathSeeker)arXivlink-
2024.10Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask ProfanityarXivlinklink
2024.09AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMsarXivlinklink
2024.09Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMsarXivlink-
2024.09Jailbreaking Large Language Models with Symbolic MathematicsarXivlink-
2024.08Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit CluesACL Findings'24linklink
2024.08Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsarXivlink-
2024.08Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier ArticlesarXivlink-
2024.08h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment (h4rm3l)arXivlinklink
2024.08EnJa: Ensemble Jailbreak on Large Language Models (EnJa)arXivlink-
2024.07Knowledge-to-Jailbreak: One Knowledge Point Worth One AttackarXivlinklink
2024.07LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsarXivlink
2024.07Single Character Perturbations Break LLM AlignmentarXivlinklink
2024.07A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI ResponsesarXivlink-
2024.07Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context)arXivlink-
2024.07SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP)arXivlinklink
2024.06Jailbreaking as a Reward Misspecification ProblemICLR'25linklink
2024.06Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ)NeurIPS'24linklink
2024.06When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker)NeurIPS'24link-
2024.06Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith)ICML'24linklink
2024.06Covert Malicious Finetuning: Challenges in Safeguarding LLM AdaptationICML'24link-
2024.06ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt)ACL'24linklink
2024.06From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF)arXivlink-
2024.06CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack)ACL'24link-
2024.06Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA)USENIX Security'24linklink
2024.06AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak)arXivlink-
2024.06Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksarXivlinklink
2024.06GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFUZZER)arXivlinklink
2024.06A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)NAACL'24linklink
2024.06QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA)arXivlinklink
2024.06Poisoned LangChain: Jailbreak LLMs by LangChain (PLC)arXivlinklink
2024.05Multilingual Jailbreak Challenges in Large Language ModelsICLR'24linklink
2024.05DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception)EMNLP'24linklink
2024.05GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS)ACL'24link-
2024.05GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD)arXivlink-
2024.05"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN)CCS'24linklink
2024.05Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher)ICLR'24linklink
2024.05Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM)NeurIPS'24link-
2024.05Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA)arXivlink-
2024.04Many-shot jailbreaking (MSJ)NeurIPS'24 Anthropiclink-
2024.04PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA)ICLR Workshop'24link-
2024.04Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM)ICASSP'24linklink
2024.04Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack)TrustNLP'24link-
2024.03Tastle: Distract large language models for automatic jailbreak attack (TASTLE)arXivlink-
2024.03DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack)EMNLP'24linklink
2024.02PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP)arXivlink-
2024.02CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon)arXivlinklink
2024.02PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL)arXivlinklink
2024.02Jailbreaking Proprietary Large Language Models using Word Substitution CipherarXivlink-
2024.02Query-Based Adversarial Prompt GenerationarXivlink-
2024.02Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack)arXivlink-
2024.02Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ)arXivlink-
2024.02Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical ThinkingNAACL'24linklink
2024.01Low-Resource Languages Jailbreak GPT-4NeurIPS Workshop'24link-
2024.01How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)arXivlinklink
2023.12Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP)NeurIPS'24linklink
2023.12Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMsarXivlink-
2023.12Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking CompetitionACL'24link-
2023.11Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona)NeurIPS Workshop'23link-
2023.10Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR)NeurIPS'24linklink
2023.10Adversarial Demonstration Attacks on Large Language Models (advICL)EMNLP'24link-
2023.10MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY)NDSS'24linklink
2023.10Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP)EMNLP'23linklink
2023.10An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack)ICLR'24linklink
2023.09Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP)EMNLP Findings'23linklink
2023.09Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA)Applied Sciences'24link-
2023.05Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt InjectionCCS'23linklink
2022.11Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject)NeurIPS WorkShop'22linklink

White-box Attack

YearTitleVenuePaperCode
2025.08Don’t Say No: Jailbreaking LLM by Suppressing Refusal (DSN)ACL'25linklink
2025.03Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous ConstraintsarXivlinklink
2025.02Improved techniques for optimization-based jailbreaking on large language models (I-GCG)ICLR'25linklink
2024.12Efficient Adversarial Training in LLMs with Continuous AttacksNeurIPS'24linklink
2024.11AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer AttemptsarXivlink-
2024.11DROJ: A Prompt-Driven Attack against Large Language ModelsarXivlinklink
2024.11SQL Injection Jailbreak: a structural disaster of large language modelsarXivlinklink
2024.10Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak AttacksarXivlink-
2024.10AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationarXivlinklink
2024.10Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weightingarXivlink-
2024.10Boosting Jailbreak Transferability for Large Language Models (SI-GCG)arXivlink-
2024.10Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (ADV-LLM)arXivlinklink
2024.08Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation (JVD)arXivlink-
2024.08Jailbreak Open-Sourced Large Language Models via Enforced Decoding (EnDec)ACL'24link-
2024.07Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference DataCOLM'24Link-
2024.07Refusal in Language Models Is Mediated by a Single DirectionarXivLinkLink
2024.07Revisiting Character-level Adversarial Attacks for Language ModelsICML'24linklink
2024.07Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3)arXivlink-
2024.07SOS! Soft Prompt Attack Against Open-Source Large Language ModelsarXivlink-
2024.06COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack)ICML'24linklink
2024.05Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMsarXivlink
2024.05Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained OptimizationNeurIPS'24Link-
2024.05AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN)ICLR'24linklink
2024.05AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG)arXivlinklink
2024.05Boosting jailbreak attack with momentum (MAC)ICLR Workshop'24linklink
2024.04AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter)arXivlinklink
2024.03Universal Jailbreak Backdoors from Poisoned Human FeedbackICLR'24link-
2024.02Attacking large language models with projected gradient descent (PGD)arXivlink-
2024.02Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE)arXivlink-
2024.02Curiosity-driven red-teaming for large language models (CRT)arXivlinklink
2023.12AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN)arXivlinklink
2023.10Catastrophic jailbreak of open-source llms via exploiting generationICLR'24linklink
2023.06Automatically Auditing Large Language Models via Discrete Optimization (ARCA)ICML'23linklink
2023.07Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)arXivlinklink

Multi-turn Attack

TimeTitleVenuePaperCode
2025.04Multi-Turn Jailbreaking Large Language Models via Attention ShiftingAAAI'25link-
2025.04X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-AgentsarXivlinklink
2025.04Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level LearningarXivlink-
2025.03Foot-In-The-Door: A Multi-turn Jailbreak for LLMsarXivlinklink
2025.03Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree SearcharXivlink-
2024.11MRJ-Agent: An Effective Jailbreak Agent for Multi-Round DialoguearXivlink-
2024.10Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models (JSP)arXivlinklink
2024.10Multi-round jailbreak attack on large languagearXivlink-
2024.10Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesarXivlinklink
2024.10Automated Red Teaming with GOAT: the Generative Offensive Agent TesterarXivlink-
2024.09LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetarXivlinklink
2024.09RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingarXivlinklink
2024.08FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)arXivlink-
2024.08Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak AttacksarXivlinklink
2024.05CoA: Context-Aware based Chain of Attack for Multi-Turn Dialogue LLM (CoA)arXivlinklink
2024.04Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo)Microsoft Azurelink-

Attack on RAG-based LLM

TimeTitleVenuePaperCode
2024.09Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using JailbreakingarXivlinklink
2024.02Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora)arXivlink-

Multi-modal Attack

TimeTitleVenuePaperCode
2024.11Jailbreak Attacks and Defenses against Multimodal Generative Models: A SurveyarXivlinklink
2024.10Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by SteparXivlink-
2024.10ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep GenerationNeurIPS'24Link-
2024.08Jailbreaking Text-to-Image Models with LLM-Based Agents (Atlas)arXivlink-
2024.07Image-to-Text Logic Jailbreak: Your Imagination can Help You Do AnythingarXivlink-
2024.06Jailbreak Vision Language Models via Bi-Modal Adversarial PromptarXivlinklink
2024.05Voice Jailbreak Attacks Against GPT-4oarXivlinklink
2024.05Automatic Jailbreaking of the Text-to-Image Generative AI SystemsICML'24 Workshoplinklink
2024.04Image hijacks: Adversarial images can control generative models at runtimearXivlinklink
2024.03An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA)ICLR'24linklink
2024.03Jailbreak in pieces: Compositional adversarial attacks on multi-modal language modelICLR'24link-
2024.03Rethinking model ensemble in transfer-based adversarial attacksICLR'24linklink
2024.02VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained ModelsNeurIPS'23linklink
2024.02Jailbreaking Attack against Multimodal Large Language ModelarXivlink-
2024.01Jailbreaking GPT-4V via Self-Adversarial Attacks with System PromptsarXivlink-
2024.03Visual Adversarial Examples Jailbreak Aligned Large Language ModelsAAAI'24link-
2023.12OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack)arXivlink-
2023.12FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep)arXivlinklink
2023.11SneakyPrompt: Jailbreaking Text-to-image Generative ModelsS&P'24linklink
2023.11On Evaluating Adversarial Robustness of Large Vision-Language ModelsNeurIPS'23linklink
2023.10How Robust is Google's Bard to Adversarial Image Attacks?arXivlinklink
2023.08AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP)ACM MM'23linklink
2023.07Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA)ICCV'23linklink
2023.07On the Adversarial Robustness of Multi-Modal Foundation ModelsICCV Workshop'23link-
2022.10Towards Adversarial Attack on Vision-Language Pre-training ModelsarXivlinklink

Jailbreak Defense

Learning-based Defense

TimeTitleVenuePaperCode
2025.12Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive ScoringarXiv'25linklink
2025.04JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language ModelCOLM'25linklink
2024.12Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language ModelsarXiv'24link-
2024.10Safety-Aware Fine-Tuning of Large Language ModelsarXiv'24link-
2024.10MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt AttacksAAAI'24link-
2024.08BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger (BaThe)arXivlink-
2024.07DART: Deep Adversarial Automated Red Teaming for LLM SafetyarXivlink-
2024.07Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (Eraser)arXivlinklink
2024.07Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak AttacksarXivlinklink
2024.06Adversarial Tuning: Defending Against Jailbreak Attacks for LLMsarXivLink-
2024.06Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo)arXivlinklink
2024.06Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding)ACL'24linklink
2024.06Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentNeurIPS'24linklink
2024.06On Prompt-Driven Safeguarding for Large Language Models (DRO)ICML'24linklink
2024.06Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO)NeurIPS'24link-
2024.06Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT)NeurIPS'24linklink
2024.05Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching (SAFEPATCHING)arXivlink-
2024.05Detoxifying Large Language Models via Knowledge Editing (DINM)ACL'24linklink
2024.05Defending Large Language Models Against Jailbreak Attacks via Layer-specific EditingarXivlinklink
2023.11MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART)ACL'24link-
2023.11Baseline defenses for adversarial attacks against aligned language modelsarXivlink-
2023.10Safe rlhf: Safe reinforcement learning from human feedbackarXivlinklink
2023.08Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT)arXivlinklink
2022.04Training a Helpful and Harmless Assistant with Reinforcement Learning from Human FeedbackAnthropiclink-

Strategy-based Defense

TimeTitleVenuePaperCode
2025.12Compressed but Compromised? A Study of Jailbreaking in Compressed LLMsNeurIPS-WlinkBlogPost Link
2025.09LLM Jailbreak Detection for (Almost) Free!arXivlinklink
2025.05Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from JailbreakingarXivlinklink
2024.11Rapid Response: Mitigating LLM Jailbreaks with a Few ExamplesarXivlinklink
2024.10RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (RePD)arXivlink-
2024.10Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models (G4D)arXivlinklink
2024.10Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language ModelsarXivlink-
2024.09HSF: Defending against Jailbreak Attacks with Hidden State FilteringarXivlinklink
2024.08EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models (EEG-Defender)arXivlink-
2024.08Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks (PG)arXivlinklink
2024.08Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (Self-Evaluation)arXivlinklink
2024.06Defending LLMs against Jailbreaking Attacks via Backtranslation (Backtranslation)ACL Findings'24linklink
2024.06SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding)ACL'24linklink
2024.06Defending Against Alignment-Breaking Attacks via Robustly Aligned LLMACL'24link-
2024.06A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)NAACL'24linklink
2024.06SMOOTHLLM: Defending Large Language Models Against Jailbreaking AttacksarXivlinklink
2024.05Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique)ACL'24linklink
2024.05PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN)ICML'24linklink
2024.05LLM Self Defense: By Self Examination, LLMs Know They Are Being TrickedICLR Tiny Paper'24linklink
2024.05GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe)ACL'24linklink
2024.05Multilingual Jailbreak Challenges in Large Language ModelsICLR'24linklink
2024.05Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss LandscapesNeurIPS'24link-
2024.05AutoDefense: Multi-Agent LLM Defense against Jailbreak AttacksarXivlinklink
2024.05Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron)arXivlinklink
2024.05Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD)arXivlink-
2024.04Protecting your llms with information bottleneckNeurIPS'24linklink
2024.04Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-TuningarXivlinklink
2024.02Certifying LLM Safety against Adversarial PromptingarXivlinklink
2024.02Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-RefinementarXivlink-
2024.02Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH)arXivlinklink
2024.01Intention Analysis Makes LLMs A Good Jailbreak Defender (IA)arXivlinklink
2024.01How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)ACL'24linklink
2023.12Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder)Nature Machine Intelligencelinklink
2023.11Detecting language model attacks with perplexityarXivlink-
2023.10RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN)ICLR'24linklink

Guard Model

TimeTitleVenuePaperCode
2026.02GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and VideoarXiv'26linklink
2025.12OmniGuard: Unified Omni-Modal Guardrails with Deliberate ReasoningarXiv'25link-
2025.10Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection (PSR)EMNLP'25linklink
2025.05GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning (GuardReasoner-VL)NeurIPS'25linklink
2025.04X-Guard: Multilingual Guard Agent for Content Moderation (X-Guard)arXiv'25linklink
2025.02ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails (ThinkGuard)arXiv'25linklink
2025.02Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red TeamingarXiv'25link-
2025.01GuardReasoner: Towards Reasoning-based LLM Safeguards (GuardReasoner)ICLR Workshop'25linklink
2024.12Lightweight Safety Classification Using Pruned Language Models (Sentence-BERT)arXiv'24link-
2024.11GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding (GuardFormer)Metalink-
2024.11Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations (LLaMA Guard 3 Vision)Metalinklink
2024.11AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Aegis2.0)Nvidia, NeurIPS'24 Workshoplink-
2024.11Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings (Sentence-BERT)arXiv'24link-
2024.11STAND-Guard: A Small Task-Adaptive Content Moderation Model (STAND-Guard)Microsoftlink-
2024.10VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled DataarXivlink-
2024.09AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Aegis)Nvidialinklink
2024.09Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (LLaMA Guard 3)Metalinklink
2024.08ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma)Googlelinklink
2024.07WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard)NeurIPS'24linklink
2024.06GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning (GuardAgent)arXiv'24link-
2024.06R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning (R2-Guard)arXivlinklink
2024.04Llama Guard 2Metalinklink
2024.03AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield)ECCV'24linklink
2023.12Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (LLaMA Guard)Metalinklink

Moderation API

TimeTitleVenuePaperCode
2023.08Using GPT-4 for content moderation (GPT-4)OpenAIlink-
2023.02A Holistic Approach to Undesired Content Detection in the Real World (OpenAI Moderation Endpoint)AAAI OpenAIlinklink
2022.02A New Generation of Perspective API: Efficient Multilingual Character-level Transformers (Perspective API)KDD Googlelinklink
-Azure AI Content SafetyMicrosoft Azure-link
-Detoxifyunitary.ai-link
-promptfoo - LLM red teaming framework with adaptive multi-turn attacks (PAIR, tree-of-attacks, crescendo)promptfoo-link

Evaluation & Analysis

TimeTitleVenuePaperCode
2026.02AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM SystemsarXivlinklink
2025.12Compressed but Compromised? A Study of Jailbreaking in Compressed LLMsNeurIPS-WlinkBlogPost Link
2025.08JADES: A Universal Framework for Jailbreak Assessment via Decompositional ScoringarXivlinklink
2025.06Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and DefenseUSENIX Security'25linklink
2025.05Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark StudyEMNLP'25linklink
2025.05PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking AttacksarXivlinklink
2025.05Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language ModelsICML'25linklink
2025.02GuidedBench: Equipping Jailbreak Evaluation with GuidelinesarXivlinklink
2024.12Agent-SafetyBench: Evaluating the Safety of LLM AgentsarXivlinklink
2024.11Global Challenge for Safe and Secure LLMs Track 1arXivlink-
2024.11JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and CircuitarXivlink-
2024.11The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and DefensearXivlink-
2024.11HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model AlignmentarXivlink-
2024.11ChemSafetyBench: Benchmarking LLM Safety on Chemistry DomainarXivlinklink
2024.11GuardBench: A Large-Scale Benchmark for Guardrail ModelsEMNLP'24linklink
2024.11What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind AttacksarXivLinklink
2024.11Benchmarking LLM Guardrails in Handling Multilingual ToxicityarXivlinklink
2024.10JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation FrameworkarXivlinklink
2024.10Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI SystemsarXivlinklink
2024.10A Realistic Threat Model for Large Language Model JailbreaksarXivlinklink
2024.10ADVERSARIAL SUFFIXES MAY BE FEATURES TOO!arXivlinklink
2024.09JAILJUDGE: A COMPREHENSIVE JAILBREAKarXivLinkLink
2024.09Multimodal Pragmatic Jailbreak on Text-to-image ModelsarXivlinklink
2024.08ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma)arXivlinklink
2024.08MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models (MMJ-Bench)arXivlinklink
2024.08Mission Impossible: A Statistical Perspective on Jailbreaking LLMsNeurIPS'24Link-
2024.07Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)arXivlinklink
2024.07JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak AttacksarXivlinklink
2024.07Jailbreak Attacks and Defenses Against Large Language Models: A SurveyarXivlink-
2024.06"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakarXivlinklink
2024.06WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming)NeurIPS'24linklink
2024.06From LLMs to MLLMs: Exploring the Landscape of Multimodal JailbreakingarXivlink-
2024.06AI Agents Under Threat: A Survey of Key Security Challenges and Future PathwaysarXivlink-
2024.06MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench)arXivlink-
2024.06ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC)ACL'24linklink
2024.06Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsNeurIPS'24linklink
2024.06JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo)arXivlinklink
2024.06Fundamental limitations of alignment in large language modelsarXivlink-
2024.06JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench)NeurIPS'24linklink
2024.06Towards Understanding Jailbreak Attacks in LLMs: A Representation Space AnalysisarXivlinklink
2024.06JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval)arXivlinklink
2024.05Rethinking How to Evaluate Language Model JailbreakarXivlinklink
2024.05Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust)arXivlinklink
2024.05Prompt Injection attack against LLM-integrated ApplicationsarXivlink-
2024.05Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting JailbreaksLREC-COLING'24linklink
2024.05LLM Jailbreak Attack versus Defense Techniques--A Comprehensive StudyNDSS'24link-
2024.05Jailbreaking ChatGPT via Prompt Engineering: An Empirical StudyarXivlink-
2024.05Detoxifying Large Language Models via Knowledge Editing (SafeEdit)ACL'24linklink
2024.04JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens)arXivlink-
2024.03How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA)arXivlinklink
2024.03Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language ModelsUSENIX Securitylink-
2024.03EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak)arXivlinklink
2024.02Comprehensive Assessment of Jailbreak Attacks Against LLMsarXivlink-
2024.02SPML: A DSL for Defending Language Models Against Prompt AttacksarXivlink-
2024.02Coercing LLMs to do and reveal (almost) anythingarXivlink-
2024.02A STRONGREJECT for Empty Jailbreaks (StrongREJECT)NeurIPS'24linklink
2024.02ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three StagesACL'24linklink
2024.02HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench)arXivlinklink
2023.12Goal-Oriented Prompt Attack and Safety Evaluation for LLMsarXivlinklink
2023.12The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-DefensivenessarXivlink-
2023.12A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language ModelsUbiSec'23link-
2023.11Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the WildarXivlink-
2023.11How many unicorns are in this image? a safety evaluation benchmark for vision llmsarXivlinklink
2023.11Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion PrinciplesarXivlink-
2023.10Explore, establish, exploit: Red teaming language models from scratcharXivlink-
2023.10Survey of Vulnerabilities in Large Language Models Revealed by Adversarial AttacksarXivlink-
2023.10Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI)ICLR'24 (oral)linklink
2023.08Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL)arXivlinklink
2023.08Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and VulnerabilitiesarXivlink-
2023.07Jailbroken: How Does LLM Safety Training Fail? (Jailbroken)NeurIPS'23link-
2023.08Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and VulnerabilitiesarXivlink-
2023.08From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacyIEEE Accesslink-
2023.07Llm censorship: A machine learning challenge or a computer security problem?arXivlink-
2023.07Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench)arXivlinklink
2023.06DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT ModelsNeurIPS'23linklink
2023.04Safety Assessment of Chinese Large Language ModelsarXivlinklink
2023.02Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security AttacksarXivlink-
2022.11Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons LearnedarXivlink-
2022.02Red Teaming Language Models with Language ModelsarXivlink
2026.03Evaluation and Alignment, The Seminal PapersManninglink-

Application

TimeTitleVenuePaperCode
2025.12Compressed but Compromised? A Study of Jailbreaking in Compressed LLMsNeurIPS-WlinkBlogPost Link
2025.08Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment FailuresarXivlinklink
2024.11Attacking Vision-Language Computer Agents via Pop-upsarXivlinklink
2024.10Jailbreaking LLM-Controlled Robots (ROBOPAIR)arXivlinklink
2024.10SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisarXivlinklink
2024.10Cheating Automatic LLM Benchmarks: Null Models Achieve High Win RatesarXivlinklink
2024.09RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing SystemsarXivlink-
2024.08A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares (APwT)arXivlink-

Other Related Awesome Repository

Contributors

yueliu1999 bhooi zqypku jiaxiaojunQAQ Huang-yihao csyuhao xszheng2020 dapurv5 ZYQ-Zoey77 mdoumbouya xyliugo zky001

(back to top)