STREAM (ChemBio)
A Standard for Transparently Reporting Evaluations in AI Model Reports
Introduction
What is STREAM (ChemBio)? STREAM is meant to be a clear and standardized reporting framework that details the key information we view as necessary to allow third parties to understand and interpret the results of dangerous capability evaluations. Note that STREAM addresses the quality of reporting, not the quality of the underlying evaluations or any resulting risk interpretations.
It is designed to be both a practical resource and an assessment tool: companies and evaluators can use this standard to structure their model reports, and third parties can refer to the standard when assessing such reports. Because the science of evaluation is still developing, we intend for this to be an evolving standard which we update and adapt over time as best practices emerge. We thus refer to the standard in this paper as “version 1”. We invite researchers, practitioners, and regulators to use and improve upon STREAM.
Why is STREAM (ChemBio) needed? Powerful AI systems could provide great benefits to society, but may also bring large-scale risks (Bengio et al, 2025; OECD, 2024), such as misuse of these systems by malicious actors (Mouton et al, 2024; Brundage et al, 2018). In response, many leading AI companies have committed to regularly testing their systems for dangerous capabilities, including capabilities related to chemical and biological misuse (METR, 2025; FMF, 2024b). These tests are often referred to as “dangerous capability evaluations”, and they are a key component of AI companies’ Frontier AI Safety Policies (FSPs), voluntary commitments to the US White House (White House, 2023), and the EU General-Purpose AI Code of Practice (European Commission, 2025).
Despite their importance, there are currently no widely used standards for documenting dangerous capability evaluations clearly alongside model deployments (Weidinger et al., 2025; Paskov 2025b). Several leading AI companies and AI Safety/Security Institutes regularly publish dangerous capability evaluation results in “model reports” (also called “model cards”, “system cards” or “safety cards”), and cite those results to support important claims about a models’ level of risk (Mitchell et al, 2019). But there is little consistency across such model reports on the evaluation details they provide.
In particular, many model reports lack sufficient information about how their evaluations were conducted, the output of the evaluations, and how the results informed judgments of a model’s potentially dangerous capabilities (Reuel et al, 2024a; Righetti, 2024; Bowen et al., 2025). This limits how informative and credible any resulting claims can be to readers, and impedes third party attempts to replicate such results.
What is NOT covered by STREAM (ChemBio)? We focus specifically on benchmark evaluations related to chemical and biological (ChemBio) capabilities. Benchmark evaluations are common in model reports, and are methodologically distinct from other types of evaluation (e.g. human uplift studies, red-teaming), while still having many overlapping considerations with such evaluations. Misuse of chemical and biological agents is well-studied in national security and law enforcement contexts (NASEM, 2018; NRC, 2008), and there appears to be more consensus on frontier AI capability thresholds for this topic than for many others (FMF, 2025b). However, many considerations in this reporting standard also apply to AI evaluation in other domains, such as cybersecurity or AI self-improvement, and it would be relatively straightforward to extend it to cover such domains.
Additionally, we intend for this version of STREAM to be applied to individual evaluations that are reported in a single document (e.g. a model report). For a model report to comply with a criterion, the information required by that criterion must be explicitly included in this document; Information reported elsewhere but not reproduced in the model report will not count toward compliance. For example, even if the authors of a benchmark include human baselines in the original benchmark paper, if a company’s model report omits these when reporting on this benchmark, it will not be in compliance with the “Baseline Performance” criteria (5.4). This is because model reports should provide third parties with one clear compilation of the evidence—providing a more consistent picture to third party readers, and ensuring that important information does not “slip through the cracks”.
STREAM (ChemBio)
Threat Relevance
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i) Capability being measured and relevance to threat model |
|
|
| (ii) Strength & nature of evidence provided by evaluation and performance threshold |
|
|
| (iii) Example item & answer |
|
|
* This can be stated just once in the model card for a suite of evaluations – note that, very frequently, a model card will simply identify the core threat model(s) once in an introductory section to the CBRN evaluations as a whole. This is fine and should still be awarded points*.
§ This is not necessary if evaluators explicitly disclose in the report that the evaluation is not a significant contributor to the model safety assessment. In that case, the “minimum” is sufficient for full credit. Additionally, these full credit requirements can be automatically count as met if, instead of labelling individual evals as rule-in or -out, both of the following conditions are met:
- the evaluation report elsewhere explains that no single evaluation is capable of ruling a capability in or out*; and
- the evaluation report explains at one point explains the overall configuration of evidence that would either “rule in” or “rule out” capabilities, as required by 5.6(i) and 5.6.(ii).
Test Construction
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i) Number of items tested & relation to full test set |
|
|
| (ii) Answer format & scoring |
|
|
| (iii) Answer-key / rubric creation and quality assurance |
|
|
| (iv-a) Human-graded: grader sample and recruiting details |
|
|
| (iv-b) Human-graded: details of grading process and grade aggregation |
|
|
| (iv-c) Human-graded: level of grader agreement |
|
|
| (v-a) Auto-graded: grader model used and modifications applied to it |
|
|
| (v-b) Auto-graded: details of auto-grading process and rubric used |
|
|
| (v-c) Auto-graded: validation of auto-grader |
|
|
Model Elicitation
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i) Model version(s) tested and relationship to launch model |
|
|
| (ii) Safeguards & bypassing techniques |
|
|
| (iii) Elicitation methods used |
|
|
** The approach to testing safeguarded models (including eg the active safeguards and the bypassing strategies) can be reported just once in the model card, so long as the reader can clearly distinguish which evaluations this applies to (if this is not made clear, the point is not awarded).
† These criterion can be met if the model card makes clear elsewhere that, unless otherwise specified, the models tested for each evaluation were of a particular type and specifies the relation of that type to the deployed model. Be sure to search the entire model card very comprehensively in order to see whether such statements exist and whether they apply to the particular eval you are scoring.
Results Reporting
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i) Main scores |
|
|
| (ii) Uncertainty & number of benchmark runs |
|
|
| (iii) Ablations / alt. conditions |
|
|
*** This can also be met if the model card makes clear elsewhere that each evaluation is run a given number of X times or, alternatively, specify a different method for establishing confidence intervals (such as a “bootstrap procedure”).
Baseline Results
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i-a) Human baseline: sample details |
|
|
| (i-b) Human baseline: full scores and uncertainty metrics |
|
|
| (i-c) Human baseline: details of elicitation, resources available and incentives provided |
|
|
| (ii-a) No human baseline: justification of absence |
|
|
| (ii-b) No human baseline: alternative to human baseline |
|
|
Safety Interpretation
| Criterion | At a minimum (0.5 pt) | Full credit (1 pt) |
|---|---|---|
| (i) Interpretation of test outcomes, capability conclusion and relation to decision-making |
|
|
| (ii) Details of test results that would have overturned the above capability conclusion and details of preregistration |
|
|
| (iii) Predictions of future performance gains from post-training and elicitation |
|
|
| (iv) Time available to relevant teams for interpretation of results |
|
|
| (v) Presence of internal disagreements |
|
|
†† Criterion in this section can be reported once per report, rather than once per evaluation.
**** Note that one way you can meet this criterion is to have clearly marked every single CBRN evaluation as a “rule-in” or “rule-out” evaluation with a clear and comprehensible performance threshold; if this is done, then it is clear that falsifying results would have been inferred from rule-in or rule-out evaluations falling on the other side of that performance threshold, though in that case the model card should ideally explain what the developer would do in the rare case that a rule-in and a rule-out threshold explicitly contradicted one another.
Endorsements
Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.
— UK AISI
Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.
— FutureHouse
Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.
— US CAISI
Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.
— Deloitte
Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.
— RAND
Authors
| Tegan McCaslin | Independent |
| Jide Alaga | METR |
| Samira Nedungadi | SecureBio |
| Seth Donoughe | SecureBio |
| Tom Reed | UK AISI |
| Chris Painter | METR |
| Rishi Bommasani | HAI |
| Luca Righetti | GovAI |
For email correspondence, contact Luca Righetti (luca.righetti@governance.ai)