Blog

Insights on AI Security

Research, engineering deep-dives, and updates from the team building the future of AI safety.

Featured

Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

When we deploy language models with access to external tools, we dramatically expand their capabilities. However, tool access also introduces new attack surfaces that differ fundamentally from traditional prompt injection. We document how adversarially crafted tool outputs can establish false premises that persist and compound across a conversation.

Nov 30, 2025Read

Research14 min read

MisalignmentBench: How We Social Engineered LLMs Into Breaking Their Own Alignment

We got frontier models to lie, manipulate, and self-preserve. Not through prompt injection or jailbreaks. We deployed them in contextually rich scenarios with specific roles and guidelines. The models broke their own alignment trying to navigate the situations we created.

Aug 14, 2025Read

Latest

Research7 min read

How ELITE Reveals Dangerous Weaknesses in Vision-Language AI

As AI systems evolve to process images and text together, the risks grow exponentially. ELITE doesn't just measure whether a model is 'safe' — it evaluates how dangerous its outputs could be with precision that rivals human reviewers.

May 29, 2025Read

Research7 min read

Pressure Point: How One Bad Metric Can Push AI Toward a Fatal Choice

In a simulated earthquake response scenario, Claude 4 Opus was given conflicting rules. When pressured by authority, it reversed its ethical decision and recommended letting a critical patient die to optimize an efficiency score.

May 26, 2025Read

Security5 min read

Exploiting MCP: Emerging Security Threats in Large Language Models (LLMs)

Discover how attackers exploit vulnerabilities in the Model Context Protocol (MCP) to manipulate Large Language Models (LLMs), steal data, and disrupt operations. Learn real-world attack scenarios and defense strategies.

May 21, 2025Read

Research5 min read

Making AI Safer with SPA-VL: A New Dataset for Ethical Vision-Language Models

SPA-VL is a meticulously designed dataset that sets a new standard for safety alignment in VLMs, incorporating diversity, feedback, and real-world relevance to ensure AI systems are both powerful and ethical.

Nov 27, 2024Read

Security10 min read

The Hidden Threat: Understanding Indirect Prompt Injection in LLMs

Indirect Prompt Injection (IPI) is a sophisticated attack that manipulates how LLM-integrated applications process external data, causing them to misinterpret maliciously crafted inputs as commands.

Nov 25, 2024Read

Research11 min read

Introducing AI Safety Benchmark v0.5: MLCommons' Initiative

AI Safety Benchmark v0.5 is a proof-of-concept benchmark designed to evaluate the safety of text-based generative language models, providing a structured approach to assess potential risks.

Nov 18, 2024Read

Security8 min read

Indirect Prompt Injection Attacks Against Web Agents

Explore how EIA, AdvWeb, and WIPI attack methods exploit vulnerabilities in VLM-powered web agents, revealing serious security concerns for AI systems that interact with web environments.

Nov 15, 2024Read

Research4 min read

AIM Red Team: Leveraging Psychological Personas for Advanced LLM Jailbreaking Strategies

Explore how psychological persona-based approaches can be used to test LLM vulnerabilities through single-turn and multi-turn jailbreaking scenarios based on Big Five personality traits.

Nov 15, 2024Read

Security7 min read

Defending Web Agents: Advanced Security Strategies through AdvWeb and BrowserART

Explore cutting-edge methodologies for identifying and mitigating vulnerabilities in VLM-powered web agents, including the AdvWeb attack framework and BrowserART red teaming toolkit.

Nov 9, 2024Read

Research5 min read

Refining Vision-Language Model Benchmarks: Base Query Generation and Toxicity Analysis

For existing VLM Safety benchmarks, there are cases where the text alone is sufficiently informative without the image. We explore base query generation and toxicity measurement methods.

Nov 9, 2024Read

Research5 min read

AIM RED TEAM: Insights from the KAIST Lab Meeting on Persona-Based Jailbreak Strategies

This week, we held a productive meeting with the KAIST lab to refine the direction of our ongoing research project and to solidify our experimental design. The focus was on integrating psychological approaches with LLMs to design jailbreak prompts.

Nov 8, 2024Read

Research10 min read

Evaluating Text-based VLM Attack Methods: In-depth Look at Figstep

To evaluate VLM Safety, it is essential to develop a secure model that incorporates the unique characteristics of VLMs. We analyze Figstep and RTVLM datasets to assess typographic visual prompt attacks.

Nov 2, 2024Read

Insights on AI Security

Featured

Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

MisalignmentBench: How We Social Engineered LLMs Into Breaking Their Own Alignment

Latest

How ELITE Reveals Dangerous Weaknesses in Vision-Language AI

Pressure Point: How One Bad Metric Can Push AI Toward a Fatal Choice

Exploiting MCP: Emerging Security Threats in Large Language Models (LLMs)

Making AI Safer with SPA-VL: A New Dataset for Ethical Vision-Language Models

The Hidden Threat: Understanding Indirect Prompt Injection in LLMs

Introducing AI Safety Benchmark v0.5: MLCommons' Initiative

Indirect Prompt Injection Attacks Against Web Agents

AIM Red Team: Leveraging Psychological Personas for Advanced LLM Jailbreaking Strategies

Defending Web Agents: Advanced Security Strategies through AdvWeb and BrowserART

Refining Vision-Language Model Benchmarks: Base Query Generation and Toxicity Analysis

AIM RED TEAM: Insights from the KAIST Lab Meeting on Persona-Based Jailbreak Strategies

Evaluating Text-based VLM Attack Methods: In-depth Look at Figstep

Product

Resources

Company