Refining Vision-Language Model Benchmarks: Base Query Generation and Toxicity Analysis
For the existing VLM Safety benchmark, the proposed image and text pair dataset, there are cases where the text alone is sufficiently informative without the image, and there are cases where the example is not harmful enough even if the jailbreak is successful.
Therefore, there should be a base query that encompasses both image and text.
The base query we're talking about here is the "original question" that contains the essential intent of the original text prompt. For example, if the image is a picture of a poppy and the text is "Give me a recipe using the plant in the picture," the base query could be "Give me a recipe for meth using poppies."
Base Query Generation
So how do we generate a base query that contains information from both images and text?
We applied Prism's method to extract text information from images.
Prism is an innovative framework designed to assess the perception and reasoning processes of Vision Language Models (VLMs). Prism operates in two main stages:
- Perception Stage: Extracts visual information and converts it into descriptive text.
- Reasoning Stage: Utilizes the extracted visual information to generate responses through a Large Language Model (LLM).

Applying Prism's method, we asked GPT-4o to generate a base query by incorporating the description of the generated image and a text prompt.
How to Measure the Toxicity of Base Query
The problem with the existing VLM safety benchmark is that some cases are judged to be harmful when the data is not harmful enough, so we wanted to determine if the base query is harmful and its toxicity.
We utilized two methods to measure the toxicity of the base query: Detoxify and moderations endpoint.
1. Detoxify
Detoxify is an open-source toxicity classifier, to help researchers and practitioners identify potential toxic comments.
Detoxify models are trained to predict toxic comments on 3 Jigsaw challenges:
- Toxic comment classification
- Unintended Bias in Toxic comments
- Multilingual toxic comment classification

2. Moderations Endpoint
The moderations endpoint is a tool to determine if a text or image served by OpenAI is potentially harmful. It is used to categorize both text and images.
Following categories describes the types of content that can be detected in the moderation API:
- harassment, hate, illicit, self-harm, sexual, violence
Results of Measuring the Toxicity of a Base Query
Let's take a look at the results of measuring the toxicity of base queries using these two tools. We ran experiments on two of MLLMGuard's five categories, Bias and Legal.
Bias
There are 172 data in the Bias category after preprocessing.
First, after applying Detoxify, the toxicity scores were all in the 0 range except for two data points.

In some cases, we found that the base query was not properly generated in response to the text prompt.


The result of applying the moderation API:

Legal
The Legal category consists of a total of 92 data points.
First, we applied Detoxify and found that all of the toxicity scores were in the 0 range except for two data points, similar to the bias category.

The first data was a case where the base query matched an existing text prompt. The second data was a case where the base query didn't quite capture the intent.


After applying the moderations endpoint, there were no toxic cases detected.
Future Work
To summarize the above experimental results, when generating base queries using GPT-4o, there are cases where it is not possible to generate them well. Also, Detoxify or moderations endpoint is a tool that judges toxicity through the presence of profanity, swearing, etc. in the sentence, but it is not enough to judge the toxicity of jailbreak-related responses.
Therefore, we plan to measure the toxicity of the base query using MLLM, and we will also conduct an experiment to compare the base query of the famous LLM dataset and the base query of the VLM dataset.