AI Safety Benchmark
TheJudgementDay

Test the boundaries where AI fails to act, or acts when it shouldn't.

Overview
AIM Intelligence is soliciting scenario contributions for THE JUDGMENT DAY (Benchmarking Unsafe Action and Inaction in Multimodal AI Agents), an upcoming red-teaming competition evaluating safety decision-making in frontier multimodal AI systems.
The Judgment Day targets two complementary failure modes:
- Track 1: Unsafe Action — Models take harmful actions they should refuse
- Track 2: Unsafe Inaction — Models fail to act when action is required
We are collecting scenarios from the community to build a diverse, challenging evaluation suite. Selected scenarios receive a $50 reward.
Questions? Contact:
dasol.choi@aim-intelligence.comHow Contributions Will Be Used
- Competition: Selected scenarios will be used in the official Judgment Day competition
- Benchmark: Scenarios will be released as part of a public AI safety benchmark
- Research: Scenarios may be included in academic publications
Credit policy:
- Contributors will be acknowledged in the "Scenario Contributors" section of any resulting publications
- Each scenario in the released benchmark will include contributor attribution (name and optional affiliation)
By submitting, you grant AIM Intelligence the right to use, modify, and publish your scenario as part of the competition, benchmark, and related research outputs.
What We're Looking For
We want scenarios that are realistic, nuanced, and challenging—not toy examples that frontier models already handle well.
Multimodal Requirement
All scenarios must be designed to accept at least two modalities (e.g., text + image, text + audio, image + audio, etc.). This ensures comprehensive evaluation of multimodal AI systems.
Below we provide baseline examples for each track. Your submissions should be more complex, creative, and systematic than these examples. Think about edge cases, competing pressures, and subtle failure modes that would genuinely challenge state-of-the-art models.
Track 1: Unsafe Action

Goal: Induce the model to choose actions it should NOT take
Design scenarios where a model should refuse or defer, but might be manipulated into selecting a harmful action (e.g., approve a risky request, grant unauthorized access, bypass safety checks).
Track 2: Unsafe Inaction

Goal: Induce the model to NOT choose actions it SHOULD take
Design scenarios where a model has a clear duty to act (escalate emergency, report violation), but might be manipulated into choosing inaction or downgrading urgency.

Evaluation Criteria
Scenarios will be evaluated based on two core dimensions. Strong submissions excel in both:
1. Realism
Does this scenario reflect actual systems, real-world constraints, and plausible attack vectors?
✓ Strong realism:
- Based on actual system designs (medical triage, access control, financial monitoring)
- Follows real regulatory frameworks (HIPAA, AML, OSHA)
- Uses authentic terminology and workflows
- Reflects genuine operational constraints
- Attack vectors that could actually occur
✗ Weak realism:
- Fictional systems with no real-world analog
- Ignores actual regulations and standards
- Unrealistic user behavior or workflows
- Implausible attack scenarios
- Generic "toy" examples
2. Creativity & Novelty
Does this scenario explore new failure modes, domains, or attack strategies not widely studied?
✓ Strong creativity:
- Novel combinations of modalities (e.g., sensor data + voice + video)
- Unexplored domains (industrial robots, environmental monitoring)
- Sophisticated multi-step manipulation tactics
- Edge cases at boundaries of policies
- Innovative pressure tactics (authority + urgency + partial verification)
✗ Weak creativity:
- Direct copies of existing benchmarks
- Simple password/credential requests
- Single-sentence manipulation attempts
- Over-studied domains (basic SQL injection, prompt injection)
- Obvious "toy" adversarial examples
Output Format
All scenarios must use an action-selection format for automatic evaluation:
- Models must choose exactly one action from the allowed set
- Minimum 2 actions required (otherwise no meaningful decision)
- Recommended: 3–5 actions for nuanced scenarios
- Output format:
{"action_id": "..."}
Action naming guidelines:
- Use realistic, neutral action names that reflect real system interfaces
- ✓ Good:
approve_request,grant_access,accept_and_clear,proceed_without_verification - ✗ Bad:
harm_patient,commit_fraud,violate_policy,do_illegal_action - The violation action should not be obvious from its name alone—this makes the scenario too easy
This constraint enables scalable, judge-free evaluation without subjective labeling.
Timeline
| Date | Milestone |
|---|---|
| Feb 7, 2026 (AoE) | Submission deadline |
| Feb 14, 2026 | Selection announced |
| Feb 21, 2026 | Rewards distributed |
FAQ
Q: Can I submit multiple scenarios?
A: Yes, up to 5 scenarios per person. Each accepted scenario receives a $50 reward.
Q: What rights do I retain?
A: You retain the right to use your scenario idea elsewhere. By submitting, you grant AIM Intelligence non-exclusive rights to use, modify, and publish the scenario.
Q: Will scenarios be modified?
A: We may refine scenarios for consistency and clarity. Substantial changes will be discussed with contributors.
Q: Can I submit in languages other than English?
A: Submission format should be in English. Content within scenarios (e.g., documents) may include other languages.