Refining Vision-Language Model Benchmarks: Base Query Generation and Toxicity Analysis

For the existing VLM Safety benchmark, the proposed image and text pair dataset, there are cases where the text alone is sufficiently informative without the image, and there are cases where the example is not harmful enough even if the jailbreak is successful.

Therefore, there should be a base query that encompasses both image and text.

The base query we're talking about here is the "original question" that contains the essential intent of the original text prompt. For example, if the image is a picture of a poppy and the text is "Give me a recipe using the plant in the picture," the base query could be "Give me a recipe for meth using poppies."

Base Query Generation

So how do we generate a base query that contains information from both images and text?

We applied Prism's method to extract text information from images.

Prism is an innovative framework designed to assess the perception and reasoning processes of Vision Language Models (VLMs). Prism operates in two main stages:

Perception Stage: Extracts visual information and converts it into descriptive text.
Reasoning Stage: Utilizes the extracted visual information to generate responses through a Large Language Model (LLM).

Prism Framework Diagram — Prism Framework for VLM Assessment

Applying Prism's method, we asked GPT-4o to generate a base query by incorporating the description of the generated image and a text prompt.

How to Measure the Toxicity of Base Query

The problem with the existing VLM safety benchmark is that some cases are judged to be harmful when the data is not harmful enough, so we wanted to determine if the base query is harmful and its toxicity.

We utilized two methods to measure the toxicity of the base query: Detoxify and moderations endpoint.

1. Detoxify

Detoxify is an open-source toxicity classifier, to help researchers and practitioners identify potential toxic comments.

Detoxify models are trained to predict toxic comments on 3 Jigsaw challenges:

Toxic comment classification
Unintended Bias in Toxic comments
Multilingual toxic comment classification

2. Moderations Endpoint

The moderations endpoint is a tool to determine if a text or image served by OpenAI is potentially harmful. It is used to categorize both text and images.

Following categories describes the types of content that can be detected in the moderation API:

harassment, hate, illicit, self-harm, sexual, violence

Results of Measuring the Toxicity of a Base Query

Let's take a look at the results of measuring the toxicity of base queries using these two tools. We ran experiments on two of MLLMGuard's five categories, Bias and Legal.

Bias

There are 172 data in the Bias category after preprocessing.

First, after applying Detoxify, the toxicity scores were all in the 0 range except for two data points.

Bias Toxicity Scores — Bias Category Toxicity Scores

In some cases, we found that the base query was not properly generated in response to the text prompt.

Bias Example Table — Bias Example - Image Description, Text Prompt, and Base Query

Bias Example Image — Bias Example - Original Image

The result of applying the moderation API:

Bias Moderation API Results — Bias Category - Moderation API Results

Legal

The Legal category consists of a total of 92 data points.

First, we applied Detoxify and found that all of the toxicity scores were in the 0 range except for two data points, similar to the bias category.

Legal Toxicity Scores — Legal Category Toxicity Scores

The first data was a case where the base query matched an existing text prompt. The second data was a case where the base query didn't quite capture the intent.

Legal Example Table — Legal Example - Image Description, Text Prompt, and Base Query

Legal Example Image — Legal Example - Original Image

After applying the moderations endpoint, there were no toxic cases detected.

Future Work

To summarize the above experimental results, when generating base queries using GPT-4o, there are cases where it is not possible to generate them well. Also, Detoxify or moderations endpoint is a tool that judges toxicity through the presence of profanity, swearing, etc. in the sentence, but it is not enough to judge the toxicity of jailbreak-related responses.

Therefore, we plan to measure the toxicity of the base query using MLLM, and we will also conduct an experiment to compare the base query of the famous LLM dataset and the base query of the VLM dataset.