2026-06-08·5 min read·sota.io Team

EU AI Act Art.10 Training Data Governance: Why Your Dataset Storage Location Is a Compliance Decision

Post #2 in the sota.io EU AI Act Infrastructure Compliance Series

EU AI Act Art.10 training data governance: EU jurisdiction requirements and CLOUD Act exposure for high-risk AI training datasets

With 55 days until the EU AI Act's August 2, 2026 enforcement deadline, high-risk AI providers have spent months building data governance programs: documenting training dataset origins, running bias tests, implementing quality criteria, and mapping personal data flows. Most of this work has focused on the process — the documentation, the tests, the records.

Few have addressed the infrastructure question underneath it all: where do your training datasets physically live, and what legal reach does that jurisdiction create?

This post covers Article 10 of Regulation (EU) 2024/1689 and why dataset storage jurisdiction is a compliance variable — not a deployment detail — for every high-risk AI system subject to the August deadline.


What Art.10 Actually Requires

Article 10 — Data and data governance is the most operationally demanding article in the EU AI Act for high-risk AI providers. It establishes comprehensive requirements for training, validation, and test datasets used in high-risk AI systems listed in Annex III.

The obligations cover six functional areas:

1. Data Governance Practices

Art.10(2)(a)–(f) requires that training, validation, and testing datasets be subject to "appropriate data governance and management practices" covering:

This is documentation-heavy by design. The NCA auditor reviewing a conformity assessment dossier will check that your training data governance trail is complete, unbroken, and — critically — unaltered.

2. Examination for Bias

Art.10(2)(f) requires examination of "possible biases that are likely to affect health and safety or result in prohibited discrimination contrary to Union law" and application of "appropriate bias detection and mitigation measures."

3. Special Category Personal Data

Art.10(5) addresses the sensitive case directly:

To the extent strictly necessary for the purposes of ensuring bias monitoring, detection and correction in relation to the high-risk AI systems, the providers of such systems may process special categories of personal data referred to in Article 9(1) of Regulation (EU) 2016/679...

This is an explicit data processing permission — but it comes with a compliance precondition. The processing must be strictly necessary, subject to suitable safeguards, and documented to a standard that an NCA can audit.

4. Data Traceability

Art.10(2)(a)–(b) effectively requires that training datasets be traceable back to their origin. For a credit-scoring model (Annex III, Pt.5b), this means being able to show that the dataset used for training v1.3 came from source X, was processed by pipeline Y, and was validated on date Z. The record must be reproducible on demand.


The Infrastructure Dimension: Where Training Data Lives

Here's the compliance problem most Art.10 implementation guides don't address.

Your training datasets, validation sets, and test sets — along with the preprocessing pipelines, bias detection scripts, and governance logs generated to satisfy Art.10 — typically live in cloud object storage:

These are all operated by US-headquartered providers. All are subject to the US CLOUD Act (18 U.S.C. § 2713).

The CLOUD Act requires US cloud providers to comply with US government requests for data stored anywhere in the world — including EU-based data centers — subject to a case-by-case proportionality review. The provider cannot simply refuse on the grounds that data is stored in Germany or Ireland.

For Art.10 compliance, this creates three specific risk scenarios.


Risk Scenario 1: Training Dataset Integrity

Your Art.10 compliance rests on being able to demonstrate that your training datasets are exactly what you documented them to be. The Art.10 trail shows:

Dataset: credit-scoring-training-v2.1.parquet
Origin: financial_transactions_2022-2024
Processing: anonymisation pipeline v3.2, executed 2025-11-14
Bias testing: demographic parity test, passed 2025-11-20
Storage: s3://my-bucket/training-data/eu-west-1/

A US Department of Justice CLOUD Act request to AWS could compel:

The request might target the dataset as part of an unrelated investigation — a counterparty, a data supplier, or even a competitor. The practical effect: your audit-ready Art.10 training data now has an undocumented access event in its chain of custody.

Under Art.10(2), training data governance practices must be documented. A CLOUD Act access event that isn't captured in your governance log is a gap. An NCA auditor asking "show me every access to this training dataset since system deployment" expects a complete answer.

What This Looks Like in an NCA Audit

The NCA market surveillance process under Art.74 allows authorities to request the complete technical documentation package — which under Art.11 and Art.10 includes:

If your training data lives in AWS S3 and has been subject to a CLOUD Act request you were not notified of (the CLOUD Act allows providers to receive non-disclosure orders), you cannot guarantee that your documented dataset state matches the current state. This is not a theoretical concern — it is a chain-of-custody problem that auditors are specifically trained to probe.


Risk Scenario 2: Special Category Data Processing Under Art.10(5)

Art.10(5) permits processing special categories of personal data (health data, racial or ethnic origin, biometric data) under strict conditions — primarily for bias detection and correction. This processing is generally concentrated in your training data preprocessing pipelines.

If those pipelines run on CLOUD Act-exposed infrastructure and process special category data:

  1. GDPR Art.44 restricts transfers of personal data to third countries without adequate protection
  2. A CLOUD Act request that compels disclosure of special category data to US authorities represents a transfer without the data subject's consent and without GDPR Art.46 appropriate safeguards
  3. Your Art.10(5) justification was "strictly necessary for bias correction" — not "responding to US law enforcement requests"

The scope of the Art.10(5) permission does not cover compelled disclosure to a third-country authority. If your bias detection pipeline ran on CLOUD Act-exposed infrastructure and that infrastructure was targeted, your Art.10(5) processing justification is structurally incomplete.


Risk Scenario 3: Dataset Availability for NCA Inspection

Article 74(9) requires that high-risk AI providers make training datasets available to market surveillance authorities "upon a reasoned request" and, for authorities and notified bodies, access to the source code or training algorithms when necessary.

This creates an EU access obligation: the NCA of the member state where your system is deployed must be able to access your training data. The access obligation exists in EU law under EU institutional authority.

But if your datasets live in AWS S3 and are simultaneously subject to a CLOUD Act confidentiality order from a US court, you face a direct conflict:

European data protection authorities have analysed analogous conflicts in the context of international data transfers. The EDPB has consistently concluded that US surveillance law does not provide equivalent protection to GDPR — a finding that applies with equal force when the data in question is your Art.10 training dataset.


The GDPR Layer

Art.10(3) states that training datasets "shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose." This is a substantive data quality standard.

Art.10(3) also interacts with GDPR in a specific way. For high-risk AI systems processing personal data in training (which is the case for most systems in Annex III Pt.1, 3, 5, and 6), the training dataset is a GDPR data processing activity.

GDPR Art.44 (transfers to third countries) is triggered when personal data is accessed by an entity in a third country — including through legal compulsion. A CLOUD Act disclosure of your training dataset to US authorities is a transfer to a third country, executed outside the GDPR transfer framework (standard contractual clauses, adequacy decisions, binding corporate rules).

US-EU Data Privacy Framework (adequacy decision adopted July 2023) does not protect against CLOUD Act requests that fall outside the DPF Principles' national security exceptions. The DPF provides no mechanism to challenge CLOUD Act production orders targeting training datasets.


What Art.10-Compliant Training Data Infrastructure Looks Like

A training data governance setup that satisfies Art.10 while managing CLOUD Act exposure has four characteristics:

1. EU-Sovereign Object Storage

Store training datasets, preprocessing outputs, and bias testing results in object storage operated by a provider not subject to US parent-company jurisdiction:

These providers cannot receive CLOUD Act requests because they are not US persons under 18 U.S.C. § 2713. A US court order has no jurisdictional reach over a French or German cloud provider.

2. Immutable Audit Logs for Art.10 Compliance Records

Your Art.10 governance records — what data was used, when, by whom, for what purpose — should be stored in append-only, tamper-evident format. Object storage with Object Lock (WORM) enabled, combined with cryptographic hash verification, gives you a defensible chain of custody.

For the NCA auditor asking "has this dataset been modified since training?" the answer should be a cryptographic proof, not a cloud provider's assurance.

import hashlib
import json
from datetime import datetime

def record_dataset_governance(dataset_path: str, metadata: dict) -> dict:
    """Record Art.10 dataset governance event with tamper-evident hash."""
    with open(dataset_path, 'rb') as f:
        dataset_hash = hashlib.sha256(f.read()).hexdigest()
    
    record = {
        "timestamp": datetime.utcnow().isoformat(),
        "dataset_path": dataset_path,
        "sha256": dataset_hash,
        "art10_metadata": metadata,
        "operator": "system",
        "jurisdiction": "EU-sovereign"
    }
    
    # Store record itself with its own hash
    record_bytes = json.dumps(record, sort_keys=True).encode()
    record["record_hash"] = hashlib.sha256(record_bytes).hexdigest()
    
    return record

3. Access Control Audit Trail

Art.10(2) data governance practices should include a complete access log for training datasets: who accessed them, when, and for what purpose. This log must be:

If your access log lives in AWS CloudTrail or Azure Monitor, it has the same CLOUD Act exposure problem as the datasets themselves.

4. Data Processing Agreements That Address Third-Country Access

Your DPAs with cloud providers should explicitly address:

Standard EU cloud provider DPAs typically include GDPR Art.28 processor commitments. They generally do not explicitly address CLOUD Act scenarios. For training datasets containing personal data under Art.10(3)–(5), your DPA should specifically address this.


Building the Art.10 Infrastructure Compliance Stack

A minimal Art.10-compliant training data infrastructure for an August 2026 deadline looks like this:

Storage layer:

Training datasets     → EU-sovereign object storage (Hetzner/Scaleway/OVH)
Preprocessing outputs → Same, with Object Lock WORM
Bias test artifacts   → Same, append-only
Governance logs       → Separate bucket, different encryption key, WORM

Access layer:

Dataset access        → EU-sovereign IAM (no AWS IAM, no Azure AD for this layer)
Access logs           → Written to WORM governance bucket automatically
NCA export API        → Pre-built data package generator for Art.74(9) requests

Documentation layer:

Art.10(2) records     → Generated automatically at each pipeline stage
Dataset provenance    → Cryptographic hash chain from origin to production
Bias test results     → Stored with dataset version reference
Art.11 technical docs → Updated to reference EU-sovereign storage locations

This architecture satisfies Art.10's traceability requirements while eliminating the CLOUD Act exposure that undermines chain-of-custody guarantees.


The Conformity Assessment Question

For high-risk AI systems that require notified body conformity assessment under Art.43, the notified body will review your complete technical documentation package under Art.11 — which includes your data governance records under Art.10.

Notified bodies in the EU are increasingly aware of CLOUD Act risk in technical documentation reviews. Several have begun requiring that providers attest to the complete access history of training datasets as part of the documentation review.

If your training data lived in AWS S3 during the training run, you need to be able to attest that no access occurred — or document and explain any access that did occur. A gap here is not automatically a conformity failure, but it is a documentation deficiency that the notified body must investigate.

The simplest attestation for the August 2026 deadline: migrate training data artifacts to EU-sovereign storage before the conformity assessment review, and document the migration as part of your Art.10 governance trail.


What to Do Before August 2, 2026

Practical steps for high-risk AI providers with training data on CLOUD Act-exposed infrastructure:

Immediate (this week):

  1. Audit where your training datasets, validation sets, and test sets currently live
  2. Identify which datasets contain personal data subject to GDPR and Art.10(3)–(5)
  3. Check whether your Art.10 governance logs are stored in the same infrastructure as the datasets

Within 30 days:

  1. Provision EU-sovereign object storage for training data artifacts
  2. Migrate current training dataset snapshots with hash verification
  3. Update Art.10 governance records to reflect new storage locations
  4. Implement access audit logging to WORM storage

Before August 2:

  1. Update Art.11 technical documentation to reflect EU-sovereign data storage
  2. Prepare Art.74(9) data package export capability for NCA requests
  3. Review DPAs with any remaining US-provider infrastructure for CLOUD Act provisions

What Art.10 Means for sota.io Customers

This is exactly where EU-native infrastructure provides a measurable compliance advantage.

When your training data and preprocessing pipelines run on sota.io's infrastructure:

Your Art.10 governance trail is complete, unbroken, and legally defensible — not contingent on what a US court may or may not compel from your cloud provider.


Summary

EU AI Act Art.10 establishes detailed training data governance requirements for high-risk AI providers. The compliance program most teams have built is correct — documentation, provenance, bias testing, quality criteria. What is often missing is the infrastructure layer that makes those records auditable and legally defensible.

Training datasets stored on AWS S3, Azure Blob, or GCP Cloud Storage are subject to CLOUD Act jurisdiction. A US government access request can compel disclosure without EU judicial oversight, creating a chain-of-custody gap in your Art.10 governance trail. For systems with special category personal data in training sets, the GDPR Art.44 implications compound the exposure.

EU-sovereign storage for training data artifacts is not just a data residency preference — it is an Art.10 compliance precondition for high-risk AI providers that need their governance documentation to survive an NCA audit or notified body conformity assessment review.

55 days to the deadline. Where does your training data live?


This post is part 2 of 5 in the sota.io EU AI Act Infrastructure Compliance Series. Part 1 covered Art.12 record-keeping and Art.19 automatic log generation. Part 3 will cover Art.9 Risk Management System documentation: which infrastructure decisions affect your RMS completeness.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.