Evaluating Text-based VLM Attack Methods: In-depth Look at Figstep

To evaluate VLM Safety, it is essential to develop a secure model that incorporates the unique characteristics of VLMs. When constructing a VLM Safety benchmark based on existing LLM benchmarks, it is crucial to take into account the differences between LLMs and VLMs.

The first key aspect to consider is the data input method. In LLMs, text is input sequentially as tokens, allowing the model to grasp context and infer meaning. In contrast, VLMs handle multimodal data by simultaneously receiving both image and text inputs. The image data is processed through a dedicated encoder, transforming it into pixel-level information or vectorized representations of visual features before being passed to the model.

Sample of Figstep Dataset — Sample of datasets of Figstep from GitHub

From this perspective, an intriguing case is the Figstep dataset. This dataset is constructed in a very simple way, where text is overlaid on a white background to create images, without considering complex VLM factors such as black-box or gradient methods. Despite this simplicity, the results presented in the paper introducing Figstep indicate an average Attack Success Rate (ASR) of over 80% across typical VLMs.

Various Datasets will be checked in this article

Experiment 1: Figstep Dataset

First, let's analyze the Figstep dataset. Figstep's image input format involves a harmful instruction with numbers 1, 2, and 3 attached, requiring a list or stepwise explanation. This harmful instruction is a simplified version of harmful instructions found in existing LLM benchmarks like Safebench.

Next, we examined whether the latest VLMs can effectively defend against this type of jailbreak attack. We selected GPT-4o as the representative closed model and Phi 3.5 and Llama 3.2 Vision as representative open-source models.

Experiment Result: Figstep Dataset

To analyze the experimental results, a difference was observed between open-source and closed models. There was a significant difference in ASR between GPT-4o (40%), Phi 3.5 (98%), and Llama 3.2 Vision (80%).

Examining the open-source models, the Figstep Image format showed a much higher ASR than the Figstep Instruction format. Based on these results, it can be concluded that in open-source models without input/output filters, providing instructions as images rather than text made them more vulnerable.

Figstep Result: Textualized Input with higher-quality result

For GPT-4o, ASR was low for both text and image formats, but text input (52%) achieved a higher ASR than image input (40%). This discrepancy could be due to two main reasons:

The data quality may be insufficient, leading to the harmful task not being classified as harmful by the model
Similar to role-playing jailbreak attacks, the base text prompt along with the text instruction facilitated response generation

Figstep-Pro Dataset

We also applied another dataset mentioned in the Figstep paper. In that study, harmful task images were split into multiple pieces to jailbreak GPT-4V, resulting in an increase in ASR from 34% to 70% for Figstep. This dataset is referred to as Figstep-Pro.

The experiment showed that GPT-4o generated many responses, achieving an ASR of 82%. However, when considering response quality (Toxicity, Convincing), the effective ASR was assessed at 56%.

Figstep-Pro Result — Figstep-Pro Dataset experimental results on GPT-4o

Experiment 2: RTVLM

RTVLM Dataset Example — Example from RTVLM (Red Teaming Visual Language Models) dataset

We also examined the RTVLM dataset from the Red Teaming Visual Language Models paper. The RTVLM dataset is designed to extensively analyze aspects like performance, hallucination, and safety in VLMs.

Experiment Result: RTVLM Dataset

When applying the 22 Jailbreak Prompts from RTVLM to the three models, we observed an almost complete failure rate (GPT-4o: 1/22, Phi 3.5: 0%, Llama 3.2 Vision: 0%). Although the tasks requested by each Jailbreak Prompt were similar to those in Figstep, the models mostly understood the purpose and declined to perform them.

RTVLM Prompt Results — RTVLM Jailbreak Prompt experimental results across models

When applying the format in RTVLM that includes both image and text within an image, we observed a lower ASR compared to Figstep. Using 25 data points for each category of Illegal Activity and Hateful Speech, the results were: GPT-4o: 6%, Phi 3.5: 2%, and Llama 3.2 Vision: 24%.

RTVLM Image Results — RTVLM Image format experimental results

Conclusions

Based on the analysis of Jailbreak attempts across various VLM models using the Figstep and RTVLM datasets, it was evident that the success rate of VLM attacks is significantly influenced by:

The directness of the prompts
The combination of image and text
The length of the input text

Overly explicit or direct prompts and images tended to be rejected by most models, as they detected the risks involved. Closed models like GPT-4o were more resilient to such attacks due to additional input/output filtering, while open-source models exhibited higher vulnerabilities with certain input formats.

These findings suggest that adjusting the representation of text and images is a crucial factor in typographic and OCR-based attacks on VLMs. Ultimately, experimental research that explores diverse input methods and attack techniques will be necessary to effectively assess and enhance VLM safety.