Before any model is considered for release, our Safety Board commissions a safety report from external evaluators. The evaluators are independent of any single member, geographically diverse by design and expert in evaluating models and eliciting dangerous capabilities. Bodies doing this kind of work today include the UK AI Safety Institute, METR and Apollo Research.
The report assesses each model across six domains. Four are misuse domains, where what matters is how much a model would help a realistic actor toward a catastrophic outcome relative to the tools they already have: biological, chemical, cyber, and manipulation. A fifth, AI R&D, concerns how far a model accelerates frontier AI research. The sixth, loss of control, concerns the model’s own propensity to act against its operators’ intent.
The table sets out the sorts of capability the evaluators will be looking at in each domain. The specific evaluations, and the levels at which a model moves between tiers, will be developed with the Safety Board and set out in a future update.
Biological
Counterfactual uplift to lower-skilled actors in obtaining, producing or deploying known agents capable of mass casualties; uplift to expert teams toward higher-impact or novel agents; assistance with specific steps of the design-build-test cycle, including lab protocols and synthesis.
Chemical
Uplift in producing and deploying known chemical weapons at scale; assistance toward novel chemical threats worse than known agents.
Cyber
Automating end-to-end intrusion against well-defended targets; discovering and reliably exploiting vulnerabilities, including zero-days, with little human input; scaling the volume and sophistication of operations; uplift toward attacks on critical national infrastructure.
Manipulation
Scaled, personalised influence or persuasion that can shift the beliefs or behaviour of large populations; deceptive or manipulative interaction at scale; erosion of information integrity.
AI R&D
Automating or substantially accelerating frontier AI research; recursive self-improvement; compressing the pace of progress to destabilising rates.
Loss of control
Propensity for deception, sabotage or resisting correction; ability to evade monitoring or to understate capability under evaluation; autonomous replication, resource acquisition, or undermining the safeguards placed on the model.
Because weights can be fine-tuned, the evaluators assess capability assuming a motivated actor reduces safety training given a set budget, applies the best available scaffolding, and treats measured capability as a lower bound rather than a ceiling, leaving a margin for elicitation gains that arrive after release.