Evaluating Text-based VLM Attack Methods: In-depth Look at Figstep
To evaluate VLM Safety, it is essential to develop a secure model that incorporates the unique characteristics of VLMs. When constructing a VLM Safety benchmark based on existing LLM benchmarks, it is crucial to take into account the differences between LLMs and VLMs.
The first key aspect to consider is the data input method. In LLMs, text is input sequentially as tokens, allowing the model to grasp context and infer meaning. In contrast, VLMs handle multimodal data by simultaneously receiving both image and text inputs. The image data is processed through a dedicated encoder, transforming it into pixel-level information or vectorized representations of visual features before being passed to the model.

From this perspective, an intriguing case is the Figstep dataset. This dataset is constructed in a very simple way, where text is overlaid on a white background to create images, without considering complex VLM factors such as black-box or gradient methods. Despite this simplicity, the results presented in the paper introducing Figstep indicate an average Attack Success Rate (ASR) of over 80% across typical VLMs.

Experiment 1: Figstep Dataset
First, let's analyze the Figstep dataset. Figstep's image input format involves a harmful instruction with numbers 1, 2, and 3 attached, requiring a list or stepwise explanation. This harmful instruction is a simplified version of harmful instructions found in existing LLM benchmarks like Safebench.
Next, we examined whether the latest VLMs can effectively defend against this type of jailbreak attack. We selected GPT-4o as the representative closed model and Phi 3.5 and Llama 3.2 Vision as representative open-source models.

Experiment Result: Figstep Dataset
To analyze the experimental results, a difference was observed between open-source and closed models. There was a significant difference in ASR between GPT-4o (40%), Phi 3.5 (98%), and Llama 3.2 Vision (80%).

Examining the open-source models, the Figstep Image format showed a much higher ASR than the Figstep Instruction format. Based on these results, it can be concluded that in open-source models without input/output filters, providing instructions as images rather than text made them more vulnerable.

For GPT-4o, ASR was low for both text and image formats, but text input (52%) achieved a higher ASR than image input (40%). This discrepancy could be due to two main reasons:
- The data quality may be insufficient, leading to the harmful task not being classified as harmful by the model
- Similar to role-playing jailbreak attacks, the base text prompt along with the text instruction facilitated response generation
Figstep-Pro Dataset
We also applied another dataset mentioned in the Figstep paper. In that study, harmful task images were split into multiple pieces to jailbreak GPT-4V, resulting in an increase in ASR from 34% to 70% for Figstep. This dataset is referred to as Figstep-Pro.
The experiment showed that GPT-4o generated many responses, achieving an ASR of 82%. However, when considering response quality (Toxicity, Convincing), the effective ASR was assessed at 56%.

Experiment 2: RTVLM

We also examined the RTVLM dataset from the Red Teaming Visual Language Models paper. The RTVLM dataset is designed to extensively analyze aspects like performance, hallucination, and safety in VLMs.
Experiment Result: RTVLM Dataset
When applying the 22 Jailbreak Prompts from RTVLM to the three models, we observed an almost complete failure rate (GPT-4o: 1/22, Phi 3.5: 0%, Llama 3.2 Vision: 0%). Although the tasks requested by each Jailbreak Prompt were similar to those in Figstep, the models mostly understood the purpose and declined to perform them.

When applying the format in RTVLM that includes both image and text within an image, we observed a lower ASR compared to Figstep. Using 25 data points for each category of Illegal Activity and Hateful Speech, the results were: GPT-4o: 6%, Phi 3.5: 2%, and Llama 3.2 Vision: 24%.

Conclusions
Based on the analysis of Jailbreak attempts across various VLM models using the Figstep and RTVLM datasets, it was evident that the success rate of VLM attacks is significantly influenced by:
- The directness of the prompts
- The combination of image and text
- The length of the input text
These findings suggest that adjusting the representation of text and images is a crucial factor in typographic and OCR-based attacks on VLMs. Ultimately, experimental research that explores diverse input methods and attack techniques will be necessary to effectively assess and enhance VLM safety.