Introducing AI Safety Benchmark v0.5: MLCommons' Initiative

As artificial intelligence continues to integrate into critical aspects of society, ensuring its safety and reliability has become a fundamental priority. AI systems, particularly language models, are now used in sensitive domains like healthcare, legal advising, and education, where their decisions and interactions can have far-reaching consequences. This makes systematic evaluation of their potential risks essential — not just for technical development but also for fostering public trust.

AI safety benchmarks provide the tools to assess and address these risks. They identify vulnerabilities, measure safety performance, and set standards that guide responsible AI innovation. By doing so, these benchmarks encourage transparency and accountability, ensuring that AI systems are not only functional but also safe for the environments they serve.

1. Introduction

1.1 The Role and Goals of MLCommons AI Safety Working Group (WG)

MLCommons is a nonprofit consortium that collaborates with researchers, engineers, and practitioners from academia and industry to enhance the reliability, safety, and efficiency of AI technologies. Known for developing AI performance benchmarks, MLCommons has significantly impacted the field, with its MLPerf benchmark increasing AI system processing speeds by over 50 times.

Established in 2023, the AI Safety Working Group (WG) aims to develop benchmarks to assess and improve the safety of AI systems. Its primary objectives are:

Evaluating AI system safety: Establishing reliable and systematic evaluation standards.
Tracking safety over time: Providing a foundation for continuous improvement.
Incentivizing safer AI development: Encouraging responsible AI innovation across industries.

1.2 AI Safety Benchmark v0.5: Purpose and Significance

AI Safety Benchmark v0.5 is a proof-of-concept benchmark designed to evaluate the safety of text-based generative language models (LMs). It provides a structured approach to assess potential risks and sets the groundwork for future expansions.

Key Features:

Seven Core Hazard Categories: The benchmark evaluates key risk areas using over 43,000 English-based test cases.
Comprehensive Approach: Unlike prior performance-focused benchmarks, the v0.5 benchmark is the first to prioritize safety evaluation.
Scalability: Designed to expand beyond text-based LMs to include text-to-image, speech-to-text, and multimodal models in future iterations.

2. Scope and Specification of the Benchmark

2.1 Systems Under Test (SUTs)

The benchmark tests general-purpose AI chat systems, which are language models designed for open-domain conversations in English. Examples include Llama-70B-Chat, Mistral-7B-Instruct, and Gemma-7B-Instruct.

2.2 Use Cases

The benchmark targets interactions between a general-purpose English assistant and adults, focusing on:

Requests for non-expert advice
Information retrieval and exploration
Expressing opinions and explaining plans

2.3 Personas

The v0.5 benchmark models interactions through three user personas:

Typical Adult User: Not malicious and does not intentionally elicit unsafe responses.
Malicious Adult User: Lacks advanced technical skills but attempts to generate harmful queries.
Vulnerable Adult User: At risk of self-harm and poses queries based on limited domain knowledge.

3. Overview of AI Safety Taxonomy

3.1 Core Hazard Categories in v0.5

The seven categories evaluated in the v0.5 benchmark include:

Violent crimes
Non-violent crimes
Sex-related crimes
Child sexual exploitation
Indiscriminate weapons (CBRNE)
Suicide & self-harm
Hate

These categories were prioritized based on potential for severe harm, societal risks, and risks to individuals.

3.2 Design and Flexibility

Hierarchical Structure: Each hazard category is subdivided into subcategories and sub-subcategories.
Ongoing Adaptation: The taxonomy will evolve to address additional risks and adapt to new modalities.

4. Test Items in the Benchmark

4.1 Why Create New Test Prompts?

New prompt datasets were developed to address:

Incomplete Coverage: Existing datasets often lack comprehensive representation.
Inconsistent Quality: Variations in dataset quality hinder consistent comparisons.
Opportunity for Improvement: Structured approach based on linguistic and behavioral theories.
Scalability: Future benchmarks will cover additional modalities and hazard categories.

4.2 Test Item Formats

The benchmark evaluates test items using chat response tests:

Single-Turn: One prompt and one response; simpler to evaluate.
Multi-Turn: Multiple back-and-forth exchanges; better replicating user experiences.

4.3 Dataset Overview

The total dataset consists of 43,090 test items, created by applying 32 templates to 725 sentence fragments.

5. Grading SUTs

5.1 Scoring on Test Items

LlamaGuard is used as an automated evaluation model to assess the safety of each model's response, classifying responses as either "Safe" or "Unsafe."

5.2 Scoring System

Grades are calculated using a five-point grading scale:

Low Risk (L)
Moderate-Low Risk (M-L)
Moderate Risk (M)
Moderate-High Risk (M-H)
High Risk (H)

5.3 Grading Results

13 open-source models were tested:

5 SUTs were rated as "High Risk (H)"
4 as "Moderate Risk (M)"
4 as "Moderate-Low Risk (M-L)"

The overall proportion of Unsafe responses was 1.2%, with the Sex-Related Crime category showing the highest proportion at 3%.

6. Limitations and Future Work

The v0.5 benchmark has several clear limitations:

Scope restricted to testing English-language LMs
Only assesses whether responses are Unsafe, without evaluating severity
Limited to three persona types
Only a subset of hazard categories was included

Future Directions

In the upcoming VLM Safety Benchmark development, we aim to create a more practical benchmark by incorporating:

Red-teaming strategies
Various adversarial prompts
Refined data classification criteria and evaluation methods

Conclusion

AI Safety Benchmark v0.5 represents an important first step toward systematic safety evaluation of language models. By establishing a structured framework for assessing potential risks, it provides the foundation for future iterations that will be more comprehensive and applicable to real-world scenarios.

As AI systems become more integrated into critical aspects of our lives, the importance of safety benchmarks like this one cannot be overstated. They are essential tools for ensuring that AI development proceeds responsibly and that the systems we deploy are trustworthy and aligned with human values.