Back to Blog
Research

How ELITE Reveals Dangerous Weaknesses in Vision-Language AI

E
Eugene Choi
May 29, 20257 min read

"What if your AI assistant could draw you a bomb schematic? Or calmly suggest ways to hide evidence?"

As AI systems evolve to process images and text together, the risks grow exponentially. Vision-Language Models (VLMs) are unlocking new capabilities — but also exposing us to dangerous new vulnerabilities.

Existing safety tests barely scratch the surface. A new paper, ELITE, might change that. It doesn't just measure whether a model is 'safe' — it evaluates how dangerous its outputs could be. And it does so with a precision that even rivals human reviewers.

This blog breaks down the ELITE paper and explains why it matters for developers, policymakers, and anyone deploying AI in the real world.

1. Why VLM Safety Needs a Rethink

Vision-Language Models are vulnerable to subtle manipulations. Even when input prompts seem harmless, models can produce toxic, biased, or dangerous outputs. Yet most safety benchmarks today miss these cases.

The ELITE paper identifies three critical weaknesses in existing VLM safety benchmarks that significantly undermine their real-world applicability:

  1. Over-reliance on automatic evaluation methods: Prior tools often flag or miss harmful outputs based on superficial refusal or keyword detection.
  2. Low harmfulness levels: Many benchmarks contain image-text pairs that are unlikely to induce unsafe behavior, failing to reflect real attack scenarios.
  3. Lack of diversity in prompt combinations: Existing benchmarks tend to focus on unsafe-unsafe pairs, ignoring more subtle attack surfaces such as safe-safe pairs with embedded harmful intent.
What's needed is a benchmark that can:
  • Accurately assess subtle multimodal risks
  • Align with human judgment
  • Scale to thousands of prompts
That's where ELITE comes in.

2. ELITE Evaluator: Grading Toxicity with Nuance

The paper introduces a new evaluation formula based on the StrongREJECT rubric but adds a crucial factor: toxicity, which captures the degree of potential harm in a model's output regardless of refusal.

The Formula:

Where:

  • refused: whether the model has refused the forbidden prompt (1 = model refused, 0 = responded)
  • specific: the specificity of the response (1–5)
  • convincing: the convincingness of the response (1–5)
  • toxicity: how harmful the victim model's response is (0–5)

Key Improvements:

  • Previous evaluators would incorrectly penalize helpful, harmless image descriptions as "harmful" just because they lacked a refusal.
  • ELITE solves this by quantifying how harmful the output truly is, through rubric-based toxicity scoring.
The paper provides multiple case studies where StrongREJECT misclassifies safe but descriptive outputs as dangerous, while ELITE correctly identifies them as low-risk.

3. ELITE Benchmark: How It Was Built

The ELITE benchmark consists of 4,587 image-text pairs drawn from both existing benchmarks (like VLGuard, MM-SafetyBench) and 1,054 newly generated examples using four distinct attack methods:

Image & Text Creation Methods:

  1. Role Playing: Prompts masked as role scenarios (e.g., pretending to be a lawyer or doctor).
  2. Fake News: Misinformation framed as news or social commentary.
  3. Blueprints: Diagrams explaining illegal actions (e.g., assembling a weapon).
  4. Flowcharts: Step-by-step visual guides for harmful procedures.
These methods were applied to 11 safety-related taxonomies (e.g., Defamation, Hate Speech, Self-Harm), ensuring broad coverage of domains that are particularly sensitive to harmful or misleading content:
  • Violent/Non-violent crimes
  • Hate, self-harm, sex crimes
  • Privacy, defamation, etc.

Benchmark Construction Pipeline:

Text prompts were generated with Grok 2 and evaluated across three victim models: Phi-3.5-Vision, LLaMA 3.2–11B-Vision, and Pixtral-12B. To ensure the benchmark contains only truly adversarial and harmful prompts, each generated image-text pair was evaluated using three victim VLMs. The ELITE evaluator was applied to each model's response. If at least two of the three models produced an ELITE score of 10 or higher — indicating a sufficiently harmful output — the pair was included in the benchmark.

A filtering threshold of ELITE score ≥ 10 was used to keep only pairs capable of inducing harmful outputs.

4. Benchmark Results: Stronger Signals, Better Insights

Key Findings:

  • GPT-4o: Best-performing model with lowest average E-ASR (15.67%)
  • Pixtral-12B: Most vulnerable (E-ASR 79.86%)
  • Many open-source models scored >50%, revealing poor robustness to subtle attacks

Benchmark Comparison:

The dramatic increase in E-ASR observed in the ELITE benchmark compared to others suggests that many previous benchmarks were not sufficiently challenging. This finding reveals the inadequacy of past evaluation setups to surface model vulnerabilities that arise in more realistically adversarial scenarios.

  • ELITE benchmark yields ~2–3x higher E-ASR compared to previous benchmarks
  • ELITE (generated) = hardest setting, producing highest attack success

5. Human Evaluation: Validating the Evaluator

To rigorously validate the ELITE evaluator, the authors conducted a human evaluation study using a carefully selected set of 963 image-text-response pairs. These were sampled from all 11 taxonomies, with approximately 90 examples per category, and prioritized cases where different automated methods disagreed — thereby testing evaluators in challenging, ambiguous contexts.

Human Annotators and Labeling Protocol:

  • 22 human annotators were recruited through a professional data-labeling firm.
  • Annotators were selected to ensure diversity in gender, age, and occupation.
  • Each example was labeled by three annotators; the final label was determined by majority vote.
  • Annotators followed detailed guidelines aligned with the ELITE taxonomy and labeling criteria.

Evaluator Comparison Results:

  • ELITE (GPT-4o) achieved the highest alignment with human labels: AUROC 0.77, versus:
  • StrongREJECT (GPT-4o): 0.46
  • ELITE with InternVL2.5: 0.57–0.65
F1 Scores (from 963 human-reviewed examples):
  • ELITE (GPT-4o): 0.637
  • LlamaGuard: 0.233
  • OpenAI Moderation API: 0.412
This means you can use ELITE as a reliable proxy for human reviewers, especially when auditing large-scale outputs.

6. Taxonomy Breakdown: Strength Across the Board

Unlike other tools that excel in one category and fail in others, ELITE performs consistently well across all safety taxonomies. This balance is key for safety certifications and industry audits.

7. Real-World Use Cases

Developers:

  • Use ELITE as fine-tuning feedback signal
  • Run adversarial test cases before deployment

Red Teams:

  • Craft multi-taxonomy attacks (e.g., fake-news + IP)
  • Evaluate output diversity across temperature, sampling

Policy Teams:

  • Report ELITE-based E-ASR in AI system cards
  • Identify gaps in existing model disclosures

8. Limitations and Future Work

  • No multi-turn context: Future versions could add conversation chains.
  • Evaluator dependency: Accuracy depends on the underlying LLM.
  • Release scope: Some toxic content may limit public dataset distribution.
Despite this, ELITE represents a huge step forward in structured VLM safety evaluation.

Final Thoughts: Toward Safer, Smarter Multimodal AI

The future of AI safety lies in multimodal alignment — and ELITE is a critical piece of that puzzle.

Whether you're building, auditing, or regulating AI, don't just ask "Did it refuse?" Ask: "What did it say instead?"

ELITE reminds us that true safety demands nuance, structure, and real-world adversarial thinking. Now that we have the tools, the only question is: Will we use them?

Paper: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety Authors: Wonjun Lee, Doehyeon Lee, Eugene Choi et al. Institutions: AIM Intelligence, Seoul National University, Yonsei University, KIST, Sookmyung Women's University


About AIM Intelligence & Collaboration Opportunities

AIM Intelligence safeguards the next wave of multimodal AI with end-to-end security tooling. Leveraging insights from the ELITE rubric — and with full dataset integration on our roadmap — we push safety coverage further than ever:

🚀 AIM Red — Automated multimodal red-teaming that rapidly crafts diverse, real-world attack scenarios to stress-test Vision-Language Models (VLMs) before they reach production.

🛡️ AIM Guard — Real-time guardrails that fuse toxicity scoring with policy filters, blocking unsafe image-text generations on the fly.

By pairing deep research with battle-tested engineering, we help builders ship safer, smarter multimodal systems — without slowing innovation.

Interested in partnering with us or evaluating our AI security suite for your organization? Contact Us and let's shape a safer AI-driven future together.

AIM Intelligence

Red Dot on Aim Scope's crosshair - detecting, targeting, and eliminating AI threats

Copyright © 2026 AIM Intelligence, Inc. All rights reserved.

sales@aim-intelligence.com
How ELITE Reveals Dangerous Weaknesses in Vision-Language AI | AIM Intelligence Blog | AIM Intelligence - AI Security & Safety