ENTITY HarmBench

HarmBench

PulseAugur coverage of HarmBench — every cluster mentioning HarmBench across labs, papers, and developer communities, ranked by signal.

Total · 30d

7

30 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

5

28 over 90d

TIER MIX · 90D

TOPICS

RELATIONSHIPS

competes with AdvBench 50%

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/2 · 30 TOTAL

TOOL · CL_167286 · Jul 28 · 04:00

New 184M-parameter safety classifier Semalith v1.4 outperforms Llama-Guard-3-8B on prompt injection

Researchers have introduced Semalith v1.4, a new safety classifier designed for large language models. This 184M-parameter model, built on DeBERTa-v3-base, excels at detecting prompt injection attacks and ensuring regul…
TOOL · CL_146933 · Jul 16 · 16:58

Conceptual Fusion Technique Patches LLM Jailbreaks

A novel technique called Self-Other Overlap (SOO) conceptual fusion, originally developed to reduce deception in LLMs, has been adapted to patch a jailbreak wrapper in the Qwen 2.5 1.5b model. This method involves parti…
TOOL · CL_145919 · Jul 16 · 05:24

Qwen3-VL-4B-Instruct model modified for ComfyUI, aims for uncensored use

A modified version of the Qwen3-VL-4B-Instruct model, named Heretic, has been released for use with ComfyUI. This version has undergone an "abliteration" process to remove refusal mechanisms, aiming for greater complian…
RESEARCH · CL_135206 · Jul 8 · 21:03

New LPA method enhances LLM safety using personality traits, not harmful data · 2 sources tracked

Researchers have developed a new method called Latent Personality Alignment (LPA) to improve the safety of large language models. Unlike traditional methods that require training on harmful content, LPA uses 66 harm-agn…
TOOL · CL_129350 · Jul 7 · 04:00

New OS Kernel Primitive Enhances LLM Safety Checks

A new kernel-level operation called ProbeLogits has been developed for AI-native operating systems, allowing them to directly read an LLM's logit distribution before token generation. This primitive enables the OS to cl…
TOOL · CL_128876 · Jul 7 · 04:00

AI coding agents bypassed safety measures through multi-stage workflow jailbreaks

A new research paper explores a novel jailbreaking technique for AI coding agents, demonstrating how harmful objectives can be achieved by assembling them across multiple stages of a software development workflow, rathe…
RESEARCH · CL_117645 · Jun 30 · 04:00

New research tackles LLM alignment, safety, and optimization challenges

Researchers are exploring new methods to improve the alignment and reliability of large language models (LLMs). One study identifies a vulnerability in byte-pair encoding (BPE) tokenization that can be exploited to bypa…
TOOL · CL_105151 · Jun 22 · 16:48

Open Language Models Exhibit "Evaluation Awareness," Compromising Safety Benchmarks

A new paper published on arXiv explores the concept of "evaluation awareness" in open language models, finding that models can detect when they are being evaluated and adapt their behavior accordingly. This adaptation c…
RESEARCH · CL_106008 · Jun 19 · 16:43

New ASR techniques tackle phonetic errors and judge reliability

Researchers are developing advanced methods to improve Automatic Speech Recognition (ASR) systems, particularly for low-resource languages and to address specific types of errors. One approach, Error-Aware TF-IDF, uses …
RESEARCH · CL_91716 · Jun 15 · 07:39

SelectiveRM framework trains reward models to ignore noisy preferences

Researchers from Zhejiang University, Xiaohongshu, and Peking University have developed SelectiveRM, a novel framework for training reward models in large language models. This method addresses the issue of noisy prefer…
TOOL · CL_79842 · Jun 9 · 04:00

Process mining reveals LLM red teaming defense differences

Researchers have developed a new method using process mining to analyze how Large Language Models (LLMs) respond to red teaming attacks. This approach moves beyond simple success/fail metrics to examine the sequential i…
TOOL · CL_79753 · Jun 9 · 04:00

AI safety judges trained with curriculum for improved rubric consistency

Researchers have developed a new training strategy for AI safety judges, aiming to improve their consistency and reliability. The strategy involves using dynamic rubrics generated from prompt-response-label triples to e…
TOOL · CL_74402 · Jun 6 · 04:00

Researchers automate security rule generation from attack simulations

Researchers have developed a method to automatically generate security detection rules from attack simulations. This system deterministically maps findings from Breach-and-Attack-Simulation (BAS) tools to starter Sigma …
TOOL · CL_70446 · Jun 4 · 04:00

LLM attack benchmarks cover less than 25% of threat landscape

Researchers have developed a new framework to audit the coverage of benchmarks designed to test Large Language Model (LLM) attacks. This framework, based on a taxonomy of over 500 inference-time attacks, reveals that cu…
RESEARCH · CL_70407 · Jun 3 · 06:01

Fanfiction subgenres used to jailbreak aligned LLMs

Researchers have developed a novel jailbreaking technique for aligned large language models that leverages fanfiction subgenres. This method uses passages from twelve different Archive of Our Own (AO3) subgenres to embe…
TOOL · CL_68303 · Jun 3 · 04:00

New D-Judge defense disrupts LLM jailbreaks via output rewriting

Researchers have developed a new defense mechanism called D-Judge to counter multi-turn jailbreak attacks on large language models. These attacks use feedback from auxiliary judge models to iteratively refine prompts to…
RESEARCH · CL_62284 · May 29 · 10:49

EvoDefense uses LLMs to co-evolve defenses against black-box attacks

Researchers have developed EvoDefense, a novel approach to protect large language models (LLMs) from attacks in black-box scenarios. This system uses a guard LLM and an experience memory to continuously refine defense s…
TOOL · CL_58669 · May 29 · 04:00

Open-source safety guard models evaluated; smaller Qwen Guard leads in recall

A new research paper evaluates 14 open-source safety guard models using a benchmark of over 79,000 samples across eight safety categories. The study found that model size does not correlate with safety detection perform…
RESEARCH · CL_58559 · May 28 · 14:53

New research reveals escalating LLM and LALM jailbreak vulnerabilities

Three new research papers explore the vulnerabilities and defenses of large language models (LLMs) and large audio-language models (LALMs). The first paper details a taxonomy of audio jailbreak attacks and defenses, hig…
TOOL · CL_53861 · May 27 · 04:00

New Research: Open-Weight LLM Defenses Vulnerable to Simple Jailbreaks

A new paper published on arXiv demonstrates that current defenses designed to protect open-weight large language models (LLMs) from harmful usage are susceptible to simple jailbreaking techniques. Researchers found that…