2026-06-09·5 min read·sota.io Team

EU AI Act Regulatory Sandbox Testing Protocol: How to Generate Conformity Assessment Evidence (2026)

Post #1599 in the sota.io EU Compliance Series — EU AI Act Regulatory Sandbox 2026 #3/5

EU AI Act Regulatory Sandbox Testing Protocol Evidence Generation 2026

Getting into an EU AI Act regulatory sandbox is the first hurdle. The second — and the one that determines whether the sandbox actually helps you — is knowing what to do once you are inside.

Many developers enter the sandbox with a general sense that they need to "test their AI system" under regulatory supervision. But the sandbox period is time-limited, your supervising authority has specific expectations, and the evidence you generate during this period is the foundation for your eventual conformity assessment under Art.43. If you leave the sandbox without the right documentation, you face the same compliance gap you started with, plus a depleted runway.

This is the third post in our five-part series on EU AI Act regulatory sandboxes. Part one covered what Art.57 sandboxes are and their protections. Part two covered application strategy and development plan structure. Here, we cover the testing phase itself: protocol design, the GDPR Art.6/Art.9 framework for sandbox data processing, Art.9(7) testing requirements, and how to turn experimental results into Annex IV documentation.

What Testing Authorization Actually Means

Article 57 authorizes member states to allow providers to develop, train, test, and validate AI systems within the sandbox under direct regulatory supervision. But "authorization" here does not mean unconditional permission to do anything. The sandbox authorization is scoped to:

Your specific AI system as described in the application
Your specific intended purpose (Annex III classification)
Your development plan milestones as approved by the NCA
The sandbox period — typically six to twelve months

Article 59 sets the participation conditions. When you sign your sandbox agreement with the NCA, you commit to:

Providing transparent access to your system, documentation, and testing logs
Notifying the authority immediately if testing reveals serious risks to health, safety, or fundamental rights
Stopping or modifying testing if the authority requests it
Complying with all conditions specified in the sandbox agreement

The authorization also has explicit limits: prohibited practices under Art.5 remain prohibited regardless of sandbox status. No testing within the sandbox can involve systems that perform social scoring, exploit vulnerabilities in persons for manipulation, or conduct real-time remote biometric identification in public spaces outside the narrow law enforcement exception. These are not compliance risks to manage around — they are absolute limits.

The EU AI Act regulatory sandbox authorizes supervised development and testing of AI systems under Art.57. But the sandbox does not suspend data protection law — GDPR applies in full throughout the sandbox period. Getting the legal basis right for personal data used in sandbox testing is one of the most practically important steps, and one that most application guides omit.

For personal data processed during sandbox AI testing, you need a valid GDPR Art.6 legal basis. The practical options are:

GDPR Art.6(1)(a) — Consent: Explicit consent from each data subject for use of their data in AI testing. This is the cleanest basis but logistically demanding. Requires granular consent: "for use in developing and testing an AI system for [specific purpose] under the supervision of [NCA] during [sandbox period]."

GDPR Art.6(1)(e) — Public task / official authority: If your AI system is being developed under direct NCA supervision for a purpose that serves a public interest (public sector AI, healthcare AI, infrastructure AI), this basis may apply. The sandbox itself provides the supervisory frame that supports this basis.

GDPR Art.6(1)(f) — Legitimate interests: For non-sensitive data processed in ways that would not reasonably be unexpected to data subjects. Requires a Legitimate Interests Assessment (LIA) that documents the balancing test. The sandbox period and NCA oversight strengthen the "proportionality" argument, but this basis is not available for special category data.

GDPR Art.6(4) — Compatible further processing: If you have existing data collected for a related purpose, Art.6(4) allows assessment of whether using it for AI testing is "compatible" with the original purpose. Key factors: the nature of the link between purposes, context, nature of the data, possible consequences, and existence of safeguards. The sandbox provides a safeguard context that supports this analysis — but it does not replace the analysis.

If your sandbox testing involves special category data — biometric data, health data, genetic data, racial/ethnic origin, political opinions, religious beliefs, or trade union membership — Art.6 alone is insufficient. You need an additional basis under GDPR Art.9(2):

Art.9(2)(a): Explicit consent for the specific AI testing purpose
Art.9(2)(i): Public interest in the area of public health (for healthcare AI)
Art.9(2)(j): Scientific or historical research purposes (with Art.89 safeguards)

Sandbox participation does not itself create an Art.9(2) legal basis. Each dataset involving special category data requires its own documented legal basis before testing begins.

Purpose limitation under Art.5(1)(b) requires that personal data be collected for "specified, explicit and legitimate purposes" and not further processed "in a manner that is incompatible." Using data collected for, say, a customer service function as AI training data requires the Art.6(4) compatibility assessment — even in a regulatory sandbox.

GDPR Chapter V transfer restrictions (Art.44-49) also apply without modification. If your training pipeline routes personal data through systems hosted by US-parent cloud providers, those providers are subject to US CLOUD Act warrants regardless of their EU data center locations. This creates a dual-jurisdiction problem:

Your sandbox supervisor (the NCA) expects data processed under their oversight to remain in EU-accessible jurisdiction
A US federal warrant can compel your cloud provider to hand over sandbox training data outside EU legal process, and outside your supervisor's knowledge

EU-sovereign infrastructure — cloud providers without US-parent structures — eliminates this exposure. The NCA's supervisory access remains the only access path.

For each dataset used in sandbox testing, document:

Legal basis: Which Art.6(1) provision applies, and why (with LIA if using (1)(f))
Art.9(2) basis if applicable: Which special category provision covers the processing
Purpose compatibility statement: If data was collected for a different purpose, the Art.6(4) compatibility assessment
Data protection measures: Pseudonymization, access controls, encryption, deletion schedule
Deletion/anonymization plan: When and how data is removed at sandbox conclusion

This documentation becomes part of your Annex IV technical documentation package (Section 2: Design specifications and training methodology, and Section 3: Monitoring, functioning, and control of the system).

Art.9(7): Testing Requirements for High-Risk AI Systems

Article 9(7) of the EU AI Act establishes that testing of high-risk AI systems must be performed prior to placing on the market or putting into service. The regulatory sandbox creates the supervised environment in which this testing occurs.

The testing obligation under Art.9(7) requires that testing:

Is carried out against the intended purpose and the specifications described in the technical documentation
Covers the population groups relevant to the system's intended use
Includes test scenarios that reflect real-world operating conditions, including edge cases and reasonably foreseeable misuse
Uses appropriate metrics to assess the performance of the AI system in relation to its intended purpose

What "Appropriate Metrics" Means in Practice

The NCA will ask for specific quantitative performance metrics. The metrics need to be:

Purpose-aligned: If your system makes credit risk assessments, accuracy against a holdout set is necessary but not sufficient. The authority wants to see performance segmented by demographic group (for fairness assessment), performance under distribution shift (training versus deployment data differences), and calibration (how well your confidence scores reflect actual probability).

Threshold-documented: Each metric needs a defined acceptance threshold that you established before testing began — not after seeing results. Post-hoc threshold setting is a significant red flag for NCA reviewers. Your development plan should have included draft metric thresholds; the sandbox testing validates or revises them.

Failure-mode catalogued: When your system produces incorrect outputs, which failure modes appear? How do they distribute across input types? The sandbox is the place to find and document failure modes under supervision rather than discovering them in production.

Testing Against Reasonably Foreseeable Misuse

Article 9's misuse requirement is often under-executed. For sandbox purposes, document:

Adversarial inputs: What happens when users deliberately manipulate input data to shift the system's output? (Relevant for credit scoring, document authentication, CV screening)
Out-of-distribution inputs: What does the system do when it receives inputs it was not trained on? Does it fail gracefully, or does it produce confident but wrong outputs?
Boundary conditions: At what input values does the system's classification cross thresholds? How sensitive is it to small perturbations near those boundaries?

The NCA sandbox supervisor may specifically request adversarial testing results. Not having them is a gap that will slow your sandbox exit and subsequent conformity assessment.

Designing Your Sandbox Testing Protocol

A sandbox testing protocol is not a software QA plan. It is a regulatory document that demonstrates you understand the risks your AI system poses and have systematically assessed those risks. Structure it in three phases.

Phase 1: Baseline Testing

Before any optimization or fine-tuning, establish baseline performance:

Train your model on the approved training dataset
Evaluate against a held-out validation set
Record all performance metrics in the metric framework established in your development plan
Identify failure modes at baseline

This gives the NCA a clear starting point. It also protects you: if later testing shows performance degraded (which sometimes happens during optimization), you have documented proof of the baseline.

Phase 2: Systematic Risk Mitigation Testing

For each risk identified in your Art.9 risk management system, design a test scenario:

Risk Identified	Test Design	Acceptance Threshold	Mitigation if Failed
Discriminatory output against protected characteristic X	Evaluate Equalized Odds across demographic groups A/B	Max 5% disparity in false positive rate	Rebalancing training data; post-processing calibration
High-confidence incorrect outputs	Calibration error on holdout	Brier score < 0.10	Temperature scaling; ensemble methods
Misuse via input manipulation	Red-team adversarial inputs (N=200 scenarios)	<15% adversarial success rate	Input validation layer

This table structure is the kind of documentation NCA reviewers find credible. It shows:

You identified specific risks (not vague "the system could be wrong" statements)
You designed tests that actually probe those risks
You set thresholds in advance
You have mitigation strategies ready

Phase 3: Real-World Conditions Validation

Art.9(7) requires testing under real-world conditions. In the sandbox context, this means:

Testing with data drawn from the actual deployment population, not just the training distribution
Simulating operational edge cases (data entry errors, missing fields, unusual input formats)
Evaluating system behavior across seasonal or temporal variation if your training data has such patterns
If possible: limited deployment to real users under sandbox supervision, with full Art.12 logging active

The sandbox agreement may authorize limited real-user testing with explicit consent and active NCA oversight. This generates the most valuable conformity assessment evidence — real interaction data, captured under supervision, with documented user impact.

Logging Infrastructure: What the Sandbox Requires

Article 12 requires high-risk AI systems to have logging capabilities that allow post-facto reconstruction of the system's functioning. During sandbox testing, Art.12 logging is not optional — it is the supervisory interface between you and the NCA.

Your sandbox logging infrastructure must capture:

Input records: What inputs the system received, with timestamps, and in a format that allows the NCA to replay the test scenario

Output records: What the system output for each input, including confidence scores and intermediate reasoning steps where applicable

Model version tracking: Which model checkpoint was running at each test timestamp (critical when you are iterating through training runs)

Data access logs: Which dataset partitions were loaded, processed, and deleted at each stage (feeds into GDPR Art.6/Art.9 documentation)

Human intervention logs: Where human reviewers overrode or corrected system outputs during supervised testing (required for Art.14 human oversight documentation)

The Data Sovereignty Problem in Sandbox Logging

Sandbox logs are, by definition, the primary documentation the NCA uses to supervise your development. If those logs are stored in CLOUD Act-exposed infrastructure, the NCA's exclusive supervisory access to your development process is compromised in principle — even if, in practice, no warrant is ever served.

EU-sovereign logging infrastructure means:

Logs stored exclusively in EU-jurisdictional systems (no US-parent cloud provider)
NCA-accessible via EU legal process only
No third-party analytics services that export log data outside the EU

This is not a compliance requirement that the EU AI Act spells out explicitly — but it is a practical prerequisite for a functioning supervisory relationship.

Turning Sandbox Evidence into Annex IV Documentation

The sandbox exit deliverable is a sandbox report that the NCA provides to you after the supervised period. Under Art.57, this report summarizes the testing conducted, the supervisory authority's observations, and any remaining compliance steps before market placement.

But the sandbox report is only part of your conformity assessment evidence package. During testing, you should be continuously building the Annex IV technical documentation that Art.43 requires.

Annex IV Sections Built During Sandbox Testing

Section 3: Information about the monitoring, functioning and control of the AI system

During sandbox testing, every test run contributes to this section. Your logging infrastructure directly generates:

System monitoring approach
Output logging methodology
Human oversight intervention records
Performance metric tracking over time

Section 4: Description of the changes made to the system and its performance

Every training run, parameter adjustment, or architecture change should be logged as a version control entry. The sandbox is the place to establish this discipline — the notified body assessing your Annex IV package will review the complete development history.

Section 5: Assessment of human oversight measures

Art.14 human oversight requirements documentation is generated by your sandbox testing process: every human reviewer intervention, every override, every escalation becomes evidence of your oversight mechanism.

Section 6: Specification of the input data

The dataset documentation you produce for each dataset used in sandbox testing (GDPR Art.6 basis, Art.9 basis if applicable, compatibility assessment) feeds directly into Annex IV Section 6 — dataset composition, representativeness, preprocessing steps.

Sections 7 and 8: Testing done and examination reports

This is the primary output of your sandbox testing protocol. Every Phase 1, Phase 2, and Phase 3 test result, with metrics, thresholds, pass/fail outcomes, and failure mode analysis, becomes your Section 7-8 evidence base.

What the Notified Body Expects After the Sandbox

When you exit the sandbox and submit your system for conformity assessment under Art.43, a notified body will review your complete Annex IV package. They will not treat sandbox participation as a compliance shortcut — but they will treat it as evidence of regulatory engagement that gives credibility to your documentation.

Specifically, notified bodies look for:

Completeness of testing: Were the right risks tested? Gaps in Art.9 risk coverage — especially for the specific Annex III category your system falls under — are the most common reason for conformity assessment delay.

Threshold rationale: Why did you choose the metric thresholds you chose? "We set 5% disparity tolerance because our risk management analysis concluded that greater disparity would constitute unacceptable discrimination risk in our use case" is the right answer. "We set 5% because we passed it" is not.

Failure mode documentation: Have you identified and characterized all significant failure modes? Not documented failure modes are more alarming to a notified body than documented ones with mitigations.

Human oversight evidence: Can you demonstrate, with specific log records from sandbox testing, that the human oversight mechanisms you claim in your technical documentation actually work in practice?

Infrastructure continuity: Is the sandbox infrastructure you tested on the same infrastructure you plan to deploy on? A notified body assessment of documentation for one infrastructure environment does not transfer to a different environment.

Preparing Your Sandbox Exit Package

Three documents should be finalized before your sandbox period ends:

1. Sandbox Test Report (from NCA): Request this from your supervisory authority approximately eight weeks before your sandbox expiry. It needs time to draft and any back-and-forth on factual accuracy should happen before the deadline.

2. Annex IV Technical Documentation Package (self-generated): The complete 8-section package incorporating all sandbox testing evidence. This is your primary conformity assessment input document.

3. Residual Compliance Gap Analysis: An honest assessment of what you still need to complete between sandbox exit and market placement. Art.17 quality management system? Final Art.72 post-market monitoring setup? Infrastructure changes? Define these explicitly — the notified body will find them anyway, and identifying them proactively demonstrates maturity.

In the next post, we cover the cross-border dimension: Art.58 NCA coordination for sandboxes, what happens when your AI system is intended for multiple EU markets, and how to manage the supervisory relationship when multiple national authorities are involved.

Infrastructure Checklist for Sandbox Testing

Before sandbox testing begins, verify:

Art.12 logging infrastructure deployed and NCA-accessible
Model version control system records all training runs with timestamps and parameter changes
GDPR Art.6 legal basis documented for each personal data source used in testing
Art.9(2) basis documented for any special category data (biometric, health, genetic)
Metric tracking dashboard allows real-time monitoring by sandbox supervisor
Data deletion/anonymization protocol documented and tested before personal data loaded
Infrastructure hosting statement prepared: EU-sovereign or CLOUD Act-exposed, with implications documented for supervisor
Failure mode escalation procedure: if testing reveals serious risk, what is the notification protocol to the NCA?
Human reviewer log captures all intervention events with timestamps and rationale
Sandbox agreement reviewed for specific testing conditions or restrictions not in the standard framework

EU AI Act regulatory sandbox series: Art.57 Developer Guide #1/5 — Application & Development Plan #2/5 — Testing Protocol #3/5 — Cross-Border Sandboxes #4/5 — Compliance Transition Checklist #5/5

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing