Introducing AI Safety Benchmark v0.5: MLCommons' Initiative
As artificial intelligence continues to integrate into critical aspects of society, ensuring its safety and reliability has become a fundamental priority. AI systems, particularly language models, are now used in sensitive domains like healthcare, legal advising, and education, where their decisions and interactions can have far-reaching consequences. This makes systematic evaluation of their potential risks essential — not just for technical development but also for fostering public trust.
AI safety benchmarks provide the tools to assess and address these risks. They identify vulnerabilities, measure safety performance, and set standards that guide responsible AI innovation. By doing so, these benchmarks encourage transparency and accountability, ensuring that AI systems are not only functional but also safe for the environments they serve.
1. Introduction
1.1 The Role and Goals of MLCommons AI Safety Working Group (WG)
MLCommons is a nonprofit consortium that collaborates with researchers, engineers, and practitioners from academia and industry to enhance the reliability, safety, and efficiency of AI technologies. Known for developing AI performance benchmarks, MLCommons has significantly impacted the field, with its MLPerf benchmark increasing AI system processing speeds by over 50 times.
Established in 2023, the AI Safety Working Group (WG) aims to develop benchmarks to assess and improve the safety of AI systems. Its primary objectives are:
- Evaluating AI system safety: Establishing reliable and systematic evaluation standards.
- Tracking safety over time: Providing a foundation for continuous improvement.
- Incentivizing safer AI development: Encouraging responsible AI innovation across industries.
1.2 AI Safety Benchmark v0.5: Purpose and Significance
AI Safety Benchmark v0.5 is a proof-of-concept benchmark designed to evaluate the safety of text-based generative language models (LMs). It provides a structured approach to assess potential risks and sets the groundwork for future expansions.
Key Features:
- Seven Core Hazard Categories: The benchmark evaluates key risk areas using over 43,000 English-based test cases.
- Comprehensive Approach: Unlike prior performance-focused benchmarks, the v0.5 benchmark is the first to prioritize safety evaluation.
- Scalability: Designed to expand beyond text-based LMs to include text-to-image, speech-to-text, and multimodal models in future iterations.
2. Scope and Specification of the Benchmark
2.1 Systems Under Test (SUTs)
The benchmark tests general-purpose AI chat systems, which are language models designed for open-domain conversations in English. Examples include Llama-70B-Chat, Mistral-7B-Instruct, and Gemma-7B-Instruct.
2.2 Use Cases
The benchmark targets interactions between a general-purpose English assistant and adults, focusing on:
- Requests for non-expert advice
- Information retrieval and exploration
- Expressing opinions and explaining plans
2.3 Personas
The v0.5 benchmark models interactions through three user personas:
- Typical Adult User: Not malicious and does not intentionally elicit unsafe responses.
- Malicious Adult User: Lacks advanced technical skills but attempts to generate harmful queries.
- Vulnerable Adult User: At risk of self-harm and poses queries based on limited domain knowledge.
3. Overview of AI Safety Taxonomy
3.1 Core Hazard Categories in v0.5
The seven categories evaluated in the v0.5 benchmark include:
- Violent crimes
- Non-violent crimes
- Sex-related crimes
- Child sexual exploitation
- Indiscriminate weapons (CBRNE)
- Suicide & self-harm
- Hate
3.2 Design and Flexibility
- Hierarchical Structure: Each hazard category is subdivided into subcategories and sub-subcategories.
- Ongoing Adaptation: The taxonomy will evolve to address additional risks and adapt to new modalities.
4. Test Items in the Benchmark
4.1 Why Create New Test Prompts?
New prompt datasets were developed to address:
- Incomplete Coverage: Existing datasets often lack comprehensive representation.
- Inconsistent Quality: Variations in dataset quality hinder consistent comparisons.
- Opportunity for Improvement: Structured approach based on linguistic and behavioral theories.
- Scalability: Future benchmarks will cover additional modalities and hazard categories.
4.2 Test Item Formats
The benchmark evaluates test items using chat response tests:
- Single-Turn: One prompt and one response; simpler to evaluate.
- Multi-Turn: Multiple back-and-forth exchanges; better replicating user experiences.
4.3 Dataset Overview
The total dataset consists of 43,090 test items, created by applying 32 templates to 725 sentence fragments.
5. Grading SUTs
5.1 Scoring on Test Items
LlamaGuard is used as an automated evaluation model to assess the safety of each model's response, classifying responses as either "Safe" or "Unsafe."
5.2 Scoring System
Grades are calculated using a five-point grading scale:
- Low Risk (L)
- Moderate-Low Risk (M-L)
- Moderate Risk (M)
- Moderate-High Risk (M-H)
- High Risk (H)
5.3 Grading Results
13 open-source models were tested:
- 5 SUTs were rated as "High Risk (H)"
- 4 as "Moderate Risk (M)"
- 4 as "Moderate-Low Risk (M-L)"
6. Limitations and Future Work
The v0.5 benchmark has several clear limitations:
- Scope restricted to testing English-language LMs
- Only assesses whether responses are Unsafe, without evaluating severity
- Limited to three persona types
- Only a subset of hazard categories was included
Future Directions
In the upcoming VLM Safety Benchmark development, we aim to create a more practical benchmark by incorporating:
- Red-teaming strategies
- Various adversarial prompts
- Refined data classification criteria and evaluation methods
Conclusion
AI Safety Benchmark v0.5 represents an important first step toward systematic safety evaluation of language models. By establishing a structured framework for assessing potential risks, it provides the foundation for future iterations that will be more comprehensive and applicable to real-world scenarios.
As AI systems become more integrated into critical aspects of our lives, the importance of safety benchmarks like this one cannot be overstated. They are essential tools for ensuring that AI development proceeds responsibly and that the systems we deploy are trustworthy and aligned with human values.