HarmBench
PulseAugur coverage of HarmBench — every cluster mentioning HarmBench across labs, papers, and developer communities, ranked by signal.
6 day(s) with sentiment data
-
Process mining reveals LLM red teaming defense differences
Researchers have developed a new method using process mining to analyze how Large Language Models (LLMs) respond to red teaming attacks. This approach moves beyond simple success/fail metrics to examine the sequential i…
-
AI safety judges trained with curriculum for improved rubric consistency
Researchers have developed a new training strategy for AI safety judges, aiming to improve their consistency and reliability. The strategy involves using dynamic rubrics generated from prompt-response-label triples to e…
-
Researchers automate security rule generation from attack simulations
Researchers have developed a method to automatically generate security detection rules from attack simulations. This system deterministically maps findings from Breach-and-Attack-Simulation (BAS) tools to starter Sigma …
-
LLM attack benchmarks cover less than 25% of threat landscape
Researchers have developed a new framework to audit the coverage of benchmarks designed to test Large Language Model (LLM) attacks. This framework, based on a taxonomy of over 500 inference-time attacks, reveals that cu…
-
Fanfiction subgenres used to jailbreak aligned LLMs
Researchers have developed a novel jailbreaking technique for aligned large language models that leverages fanfiction subgenres. This method uses passages from twelve different Archive of Our Own (AO3) subgenres to embe…
-
New D-Judge defense disrupts LLM jailbreaks via output rewriting
Researchers have developed a new defense mechanism called D-Judge to counter multi-turn jailbreak attacks on large language models. These attacks use feedback from auxiliary judge models to iteratively refine prompts to…
-
EvoDefense uses LLMs to co-evolve defenses against black-box attacks
Researchers have developed EvoDefense, a novel approach to protect large language models (LLMs) from attacks in black-box scenarios. This system uses a guard LLM and an experience memory to continuously refine defense s…
-
LLM attack benchmarks show significant gaps in security coverage
Researchers have developed a new framework to audit the coverage of LLM attack benchmarks, revealing significant gaps in current evaluations. Their analysis of six public benchmarks showed they collectively cover less t…
-
New Logit-Gap Steering method efficiently measures AI alignment robustness
Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
-
CorrSteer method enhances LLM steering using correlated sparse autoencoder features
Researchers have developed CorrSteer, a novel method for steering large language models (LLMs) during generation using features extracted from Sparse Autoencoders (SAEs). This technique correlates sample correctness wit…
-
New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies
Multiple research papers explore advanced techniques for enhancing the safety and robustness of large language models (LLMs) against jailbreak attacks. These studies introduce novel frameworks and methods for evaluating…
-
New attack redirects LLM attention to bypass safety alignment
Researchers have developed a new white-box adversarial attack called the Attention Redistribution Attack (ARA) that targets the internal attention mechanisms of safety-aligned large language models. This attack crafts n…
-
New red-teaming method ContextualJailbreak bypasses LLM safety alignment
Researchers have developed ContextualJailbreak, an evolutionary red-teaming strategy designed to find vulnerabilities in large language models. This black-box approach uses simulated multi-turn dialogues and a graded ha…
-
New tool AgentSeer reveals critical gaps in LLM agentic security
Researchers have developed a new tool called AgentSeer to evaluate the vulnerabilities of large language models (LLMs) when they are deployed in agentic systems. This tool decomposes agent executions into action-compone…
-
LLM safety benchmarks show high sensitivity to judge configuration choices
A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, coul…