ENTITY AdvBench

AdvBench

PulseAugur coverage of AdvBench — every cluster mentioning AdvBench across labs, papers, and developer communities, ranked by signal.

Total · 30d

5

13 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

5

13 over 90d

TIER MIX · 90D

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/1 · 13 TOTAL

RESEARCH · CL_151938 · Jul 20 · 04:00

New frameworks emerge to evaluate and defend against LLM jailbreaks · 4 sources tracked

Researchers are developing new methods to evaluate and defend against jailbreak attacks on large language models (LLMs). One approach, Incomplete Prompt Jailbreaks (IPJ), focuses on how LLMs delay refusal of harmful pro…
TOOL · CL_133542 · Jul 9 · 04:00

New NonTextual Target Attack bypasses LLM safety measures with 96.8% success

Researchers have developed a new method called NonTextual Target Attack (NTA) to bypass safety measures in Large Language Models (LLMs). Unlike previous attacks that relied on specific target outputs, NTA focuses on max…
TOOL · CL_128876 · Jul 7 · 04:00

AI coding agents bypassed safety measures through multi-stage workflow jailbreaks

A new research paper explores a novel jailbreaking technique for AI coding agents, demonstrating how harmful objectives can be achieved by assembling them across multiple stages of a software development workflow, rathe…
RESEARCH · CL_128509 · Jul 6 · 03:59

New RetroCoT method bypasses LLM safety alignment by reframing harmful requests

Researchers have developed a new method called Retroactive Chain-of-Thought (RetroCoT) to test the safety alignment of large language models. This technique reframes harmful requests as forensic reconstruction tasks, pr…
RESEARCH · CL_122984 · Jul 2 · 08:17

New STEER attack exploits LLM safety gaps in multilingual contexts · 3 sources tracked

Researchers have developed a new method called STEER (Safety Targeted Embedding Exploit via Refinement) to exploit vulnerabilities in the safety training of large language models (LLMs). This technique targets models tr…
TOOL · CL_74402 · Jun 6 · 04:00

Researchers automate security rule generation from attack simulations

Researchers have developed a method to automatically generate security detection rules from attack simulations. This system deterministically maps findings from Breach-and-Attack-Simulation (BAS) tools to starter Sigma …
RESEARCH · CL_70412 · Jun 3 · 08:49

Hybrid defense framework boosts LLM accuracy and robustness

Researchers have developed a novel hybrid defense framework to combat both hallucinations and adversarial manipulation in large language models. This approach integrates entropy-based methods for reducing hallucinations…
RESEARCH · CL_62284 · May 29 · 10:49

EvoDefense uses LLMs to co-evolve defenses against black-box attacks

Researchers have developed EvoDefense, a novel approach to protect large language models (LLMs) from attacks in black-box scenarios. This system uses a guard LLM and an experience memory to continuously refine defense s…
RESEARCH · CL_58559 · May 28 · 14:53

New research reveals escalating LLM and LALM jailbreak vulnerabilities

Three new research papers explore the vulnerabilities and defenses of large language models (LLMs) and large audio-language models (LALMs). The first paper details a taxonomy of audio jailbreak attacks and defenses, hig…
TOOL · CL_53861 · May 27 · 04:00

New Research: Open-Weight LLM Defenses Vulnerable to Simple Jailbreaks

A new paper published on arXiv demonstrates that current defenses designed to protect open-weight large language models (LLMs) from harmful usage are susceptible to simple jailbreaking techniques. Researchers found that…
RESEARCH · CL_53580 · May 26 · 14:51

New BAIT Framework Exploits LLM Reasoning for Jailbreaking

Researchers have developed a new three-step framework called BAIT (Boundary-Aware Iterative Trap) designed to escalate disclosure of malicious content from large language models. This method guides models through identi…
TOOL · CL_15984 · May 5 · 04:00

New Logit-Gap Steering method efficiently measures AI alignment robustness

Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
RESEARCH · CL_11458 · Apr 30 · 04:13

New diagnostic tool probes LLM circuits for safety and behavior insights

A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…