2026-06-09·5 min read·sota.io Team

EU AI Act Art.72 PMS: MLOps Implementation Guide — Drift Detection, Alert Thresholds & Retraining Triggers for High-Risk AI 2026

Post #2 in the sota.io EU AI Act Post-Market Monitoring Operations 2026 Series

MLOps drift detection pipeline with alert thresholds for EU AI Act Art.72 Post-Market Monitoring

The EU AI Act Art.72 requires high-risk AI system providers to establish Post-Market Monitoring (PMS) systems that continuously track real-world performance. But regulators do not define how to detect drift, what thresholds constitute a reportable degradation, or when retraining triggers a new conformity assessment. Those decisions are yours — and they will be audited.

This guide covers the MLOps layer of PMS compliance: concrete drift detection methods, threshold calibration frameworks, retraining trigger logic, and the exact boundary between an internal alert and an Art.73 serious incident notification.

The Three-Layer Drift Problem in High-Risk AI

High-risk AI systems face three distinct types of performance degradation, each with different detection approaches and regulatory implications under Art.72:

Data drift (covariate shift): The statistical distribution of input features shifts from the training distribution. A credit-scoring model trained on 2023 income data may receive 2026 salary structures it has never seen. The model continues producing outputs — but accuracy against ground truth degrades silently.

Concept drift (label shift): The relationship between inputs and the correct output changes. A fraud detection model trained before a new payment fraud pattern emerges will classify novel fraud as legitimate. No input feature changes; the world changed.

Model degradation (performance drift): Measured accuracy, fairness, or calibration metrics fall below acceptable thresholds regardless of identified drift cause. This is the direct Art.72 trigger — the regulation requires monitoring "performance against intended purpose."

EU AI Act Art.72(1) requires your PMS system to be "proportionate to the risks and the market size" — meaning your drift detection infrastructure needs to match the risk tier of your system. Annex III high-risk systems require robust, documented monitoring; a lightweight accuracy spot-check is insufficient.

Setting Alert Thresholds: The Regulatory Framework

The most critical design decision in Art.72 PMS implementation is threshold calibration: at what degradation level must you act? The AI Act provides a framework but not numbers.

Tier 1: Internal Investigation Threshold

This is your first-line alert — automated monitoring that triggers internal review without immediate regulatory notification.

Calibrate Tier 1 thresholds using your pre-deployment validation baseline:

# Example threshold calibration using validation set performance
from evidently.metrics import DataDriftPreset, ClassificationPreset

BASELINE = {
    "accuracy": 0.924,     # from validation report in technical documentation
    "f1_macro": 0.891,
    "auc_roc": 0.967,
    "fairness_max_disparity": 0.08,  # across protected characteristics
}

TIER_1_THRESHOLDS = {
    # Relative degradation from baseline
    "accuracy": BASELINE["accuracy"] - 0.03,           # -3% → internal review
    "f1_macro": BASELINE["f1_macro"] - 0.05,           # -5% → internal review
    "auc_roc": BASELINE["auc_roc"] - 0.02,             # -2% → internal review
    "fairness_max_disparity": 0.12,                     # absolute → internal review
    # Data drift: Population Stability Index
    "psi_score": 0.10,                                  # PSI > 0.1 → investigate
    # Calibration drift
    "expected_calibration_error": 0.04,                # ECE > 4% → investigate
}

Documentation requirement: These thresholds must appear in your technical documentation (Annex IV, Section 2c) with justification. "We set accuracy drop at -3% because our deployment context involves [use case] where errors have [impact]" is the level of reasoning auditors expect.

Tier 2: Serious Performance Degradation Threshold

Tier 2 alerts indicate a risk to safety or fundamental rights that requires an Art.73 serious incident assessment. EU AI Act Art.73(1) defines a serious incident as one that results in — or reasonably could result in — death, serious injury, disruption to critical infrastructure, or violation of fundamental rights.

TIER_2_THRESHOLDS = {
    # Absolute performance collapse
    "accuracy": BASELINE["accuracy"] - 0.10,           # -10% → Art.73 assessment
    "f1_macro": BASELINE["f1_macro"] - 0.15,           # -15% → Art.73 assessment
    # Severe fairness degradation
    "fairness_max_disparity": 0.20,                    # 20%+ gap → Art.73 assessment
    # PSI indicating complete distribution collapse
    "psi_score": 0.25,                                  # PSI > 0.25 → Art.73 assessment
}

The Tier 1 to Tier 2 gap is your investigation window — the period in which you must diagnose the degradation before it becomes a reportable incident. If your monitoring detects Tier 1 but you do not investigate and remediate before Tier 2 is hit, your Art.73 notification timeline may include a period of negligence.

Drift Detection Implementation

Data Drift Detection with Population Stability Index

PSI is the most regulation-friendly data drift metric — it is interpretable, documented in Basel II/III banking regulations (making it familiar to Annex III financial sector deployments), and straightforward to explain to NCA auditors.

import numpy as np
from scipy.stats import chi2_contingency

def calculate_psi(reference: np.ndarray, current: np.ndarray, buckets: int = 10) -> float:
    """
    Population Stability Index.
    < 0.10: No significant change
    0.10-0.25: Moderate change → Tier 1 alert
    > 0.25: Major shift → Tier 2 assessment
    """
    breakpoints = np.quantile(reference, np.linspace(0, 1, buckets + 1))
    breakpoints[0] = -np.inf
    breakpoints[-1] = np.inf

    ref_counts = np.histogram(reference, bins=breakpoints)[0] / len(reference)
    cur_counts = np.histogram(current, bins=breakpoints)[0] / len(current)

    # Avoid log(0)
    ref_counts = np.where(ref_counts == 0, 0.0001, ref_counts)
    cur_counts = np.where(cur_counts == 0, 0.0001, cur_counts)

    psi = np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))
    return float(psi)

Concept Drift Detection with ADWIN

For concept drift where you have delayed labels (a common situation in Annex III systems where outcomes are known weeks after prediction), the Adaptive Windowing (ADWIN) algorithm detects statistical changes in error rates as ground truth arrives:

from river import drift

class ConceptDriftMonitor:
    def __init__(self, feature_name: str):
        self.detector = drift.ADWIN(delta=0.002)  # false positive rate ~0.2%
        self.feature_name = feature_name
        self.drift_events = []

    def update(self, error_value: float, timestamp: str) -> bool:
        """Returns True if concept drift detected."""
        self.detector.update(error_value)
        if self.detector.drift_detected:
            self.drift_events.append({
                "timestamp": timestamp,
                "feature": self.feature_name,
                "drift_type": "concept"
            })
            return True
        return False

EU AI Act record-keeping note: Art.12 requires logging of "automatically generated logs" to be retained for the period specified by applicable law — minimum 6 months for Annex III systems. Your drift event log is part of this required audit trail.

Fairness Monitoring in Production

Art.72 PMS requirements intersect with GDPR Article 22 DPIA obligations in Annex III systems that make automated decisions affecting individuals. Fairness monitoring is not optional for systems in employment (Annex III pt.4), credit (Annex III pt.5b), or law enforcement (Annex III pt.6) contexts.

from sklearn.metrics import classification_report
from collections import defaultdict

class FairnessMonitor:
    """Monitors demographic performance parity for Art.72 + GDPR Art.22 compliance."""

    def __init__(self, protected_attributes: list[str], max_disparity: float = 0.08):
        self.protected_attributes = protected_attributes
        self.max_disparity = max_disparity  # document this in technical docs

    def compute_disparities(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        groups: dict[str, np.ndarray]
    ) -> dict:
        """
        Computes per-group accuracy and max disparity across protected groups.
        Returns dict with per-group metrics and max_disparity flag.
        """
        results = {}
        group_accuracies = []

        for group_name, mask in groups.items():
            if mask.sum() < 30:  # minimum sample size for reliable measurement
                continue
            acc = (y_true[mask] == y_pred[mask]).mean()
            results[group_name] = {"accuracy": float(acc), "n": int(mask.sum())}
            group_accuracies.append(acc)

        if len(group_accuracies) >= 2:
            max_disparity = max(group_accuracies) - min(group_accuracies)
            results["max_disparity"] = float(max_disparity)
            results["tier1_breach"] = max_disparity > self.max_disparity
            results["tier2_breach"] = max_disparity > (self.max_disparity * 2.5)

        return results

Retraining Triggers: The Conformity Assessment Boundary

The most operationally dangerous aspect of Art.72 PMS is the retraining boundary: when does retraining require a new conformity assessment under Art.43?

EU AI Act Annex I Section 2 defines "substantial modification" as changes that affect "the intended purpose of the high-risk AI system" or that "alter the level of risk presented." Retraining that changes only the model weights without changing architecture, training objective, or deployment scope is generally not a substantial modification — but you must document this judgment.

Safe Retraining Boundaries

class RetrainingController:
    """
    Controls when retraining is triggered and documents modification scope
    for Art.43 substantial modification assessment.
    """

    SAFE_RETRAINING_CONDITIONS = [
        "continuous_learning_within_training_distribution",
        "data_refresh_same_schema_same_objective",
        "hyperparameter_tuning_no_architecture_change",
        "fine_tuning_pretrained_base_unchanged",
    ]

    REQUIRES_MODIFICATION_ASSESSMENT = [
        "new_output_classes_added",
        "training_objective_changed",
        "input_feature_schema_changed",
        "deployment_scope_expanded",
        "risk_tier_reclassification",
        "new_protected_attribute_in_scope",
    ]

    def assess_retraining_type(self, change_descriptor: dict) -> dict:
        """
        Returns whether retraining requires new conformity assessment.
        """
        flags = []
        for condition in self.REQUIRES_MODIFICATION_ASSESSMENT:
            if change_descriptor.get(condition, False):
                flags.append(condition)

        return {
            "requires_conformity_assessment": len(flags) > 0,
            "triggering_conditions": flags,
            "documentation_required": True,  # always document
            "nb_notification_required": len(flags) > 0 and change_descriptor.get("nb_certified", False),
        }

Automated Retraining Pipeline with Compliance Gates

import logging
from datetime import datetime, timezone

class ComplianceAwareRetrainingPipeline:
    """
    Retraining pipeline with EU AI Act Art.72 compliance gates.
    All decisions are logged for Art.12 audit trail requirements.
    """

    def __init__(self, model_id: str, notified_body_id: str = None):
        self.model_id = model_id
        self.notified_body_id = notified_body_id
        self.audit_log = []

    def _log_compliance_event(self, event_type: str, details: dict):
        self.audit_log.append({
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "model_id": self.model_id,
            "event_type": event_type,
            "details": details
        })

    def trigger_retraining(
        self,
        drift_report: dict,
        change_descriptor: dict
    ) -> dict:
        """
        Evaluates whether retraining should proceed and documents the decision.
        """
        controller = RetrainingController()
        assessment = controller.assess_retraining_type(change_descriptor)

        self._log_compliance_event("retraining_trigger", {
            "drift_report": drift_report,
            "change_descriptor": change_descriptor,
            "conformity_assessment_required": assessment["requires_conformity_assessment"],
        })

        if assessment["requires_conformity_assessment"]:
            self._log_compliance_event("conformity_assessment_initiated", {
                "triggers": assessment["triggering_conditions"],
                "nb_id": self.notified_body_id,
            })
            return {
                "proceed": False,
                "reason": "substantial_modification_requires_assessment",
                "action_required": "complete_new_conformity_assessment",
            }

        # Safe to retrain — document the boundary
        self._log_compliance_event("retraining_approved", {
            "basis": "no_substantial_modification_detected",
            "change_scope": change_descriptor,
        })
        return {
            "proceed": True,
            "modification_type": "non_substantial",
            "documentation": "retraining_within_approved_boundary",
        }

The PMS Dashboard: What NCA Auditors Will Expect to See

When your NCA conducts a market surveillance inspection under Art.74, your PMS dashboard is a primary audit artifact. Based on Annex VII inspection procedures, auditors expect to see:

Time-series performance graphs with baseline overlays and threshold markers clearly labeled. "Accuracy degraded 3.2% over 6 months" is not sufficient — auditors want to see the trajectory and the alert trigger point.

Drift event log with timestamps, affected features, PSI/ADWIN values, and the investigative response taken. Each event should have a closure status: resolved-no-action, resolved-retraining, or escalated-to-art73.

Fairness metric dashboard broken down by protected characteristic and time period. Trend charts showing whether demographic performance gaps are stable, widening, or narrowing.

Retraining decision log with the substantial modification assessment for each retraining event. If you retrained 12 times without a conformity assessment, each instance needs a documented justification.

Incident escalation record showing the linkage between PMS events and Art.73 serious incident notifications. If you have PMS alerts with no corresponding Art.73 investigation, auditors will ask why.

PMS Tool Stack for EU Hosting

EU AI Act Art.12 record-keeping requirements create a CLOUD Act jurisdiction problem for US-hosted monitoring tools: if your PMS logs are stored in AWS/Azure/GCP us-east-1, a US government subpoena can compel disclosure of your AI system's real-time performance data without your knowledge. For Annex III systems in financial services, healthcare, or law enforcement, this creates a dual obligation breach.

EU-sovereign PMS infrastructure options:

Component	US-Hosted (Problematic)	EU-Sovereign Alternative
Metrics storage	Datadog, New Relic	Grafana on Hetzner/OVHcloud
Drift detection	SageMaker Model Monitor	Evidently AI (self-hosted)
Log storage	CloudWatch, Splunk Cloud	OpenSearch on IONOS
Feature store	AWS Feature Store	Feast on Scaleway
ML pipeline	SageMaker Pipelines	Kubeflow on EU K8s cluster

Deploying your PMS stack on EU-sovereign infrastructure does not automatically mean it is Art.72 compliant — the monitoring logic and thresholds are the compliance layer. But hosting on EU infrastructure closes the jurisdiction gap that Art.10 data governance requirements implicitly require.

The Alert → Investigation → Resolution Protocol

Art.72 PMS is not just detection — it is a closed-loop system. EU AI Act Annex VI (Post-Market Monitoring Plan requirements) implicitly requires a documented response protocol.

The compliant response workflow:

T+0: Automated monitoring detects Tier 1 threshold breach. Alert fires to on-call ML engineer and compliance officer. PMS event logged with ID.

T+24h: Root cause analysis initiated. Is this data drift, concept drift, or model degradation? Is the degradation correlated with a specific deployment segment (geography, user cohort, time window)?

T+72h: Internal severity assessment completed. Does the degradation meet Art.73 criteria for a "serious incident"? If no: remediation plan documented. If yes: serious incident notification process initiated (4-hour initial notification under Art.73(3) if incident has already occurred or is reasonably likely).

T+14d: Remediation completed (retraining, data correction, or deployment rollback) or Art.73 notification sent. PMS event closed with root cause and corrective action documented.

This 14-day resolution window is not mandated by the AI Act, but it aligns with NIS2 Art.23 incident resolution timelines and creates a defensible response posture for audit purposes.

August 2026 Implementation Checklist

With the EU AI Act August 2, 2026 deadline for high-risk AI systems approaching, here is the MLOps PMS implementation priority order:

Week 1-2: Define baseline metrics from pre-deployment validation report. Calculate Tier 1 and Tier 2 thresholds with justification document.
Week 3-4: Implement data drift monitoring (PSI for tabular, embedding distance for unstructured). Wire Tier 1 alerts to incident management system.
Week 5-6: Implement concept drift monitoring (ADWIN or Page-Hinkley) for features with delayed ground truth. Calibrate false positive rate against your ground truth arrival latency.
Week 7-8: Add fairness monitoring for each protected characteristic relevant to your Annex III use case. Ensure demographic breakdown is in your technical documentation.
Week 9-10: Build retraining pipeline with substantial modification assessment gate. Document the boundary between safe retraining and new conformity assessment triggers.
Week 11-12: Build PMS dashboard with time-series baselines, drift event log, fairness trends, and retraining decision log. Conduct internal audit review before August 2 deployment.
Week 13+: Validate PMS against NCA inspection checklist (Annex VII). Run tabletop exercise simulating a Tier 2 event escalation to Art.73.

What This Means for Your Hosting Decision

Art.72 PMS compliance requires log retention (Art.12), jurisdiction-clean data flows (Art.10), and an accessible audit trail for NCA inspection (Art.74). Each of these requirements is harder to satisfy when your ML infrastructure spans US-owned cloud services.

EU-native PaaS providers that offer Kubernetes, managed databases, and container registries within EU jurisdiction — without US-parent CLOUD Act exposure — eliminate the infrastructure compliance gap before it becomes an audit finding. Building your PMS stack on EU sovereign infrastructure means one fewer cross-compliance problem at the August deadline.

This post is part of the sota.io EU AI Act Post-Market Monitoring Operations 2026 Series. Post #1 covered Art.72 PMS Plan requirements and KPIs. Post #3 will cover bias monitoring in production and demographic performance tracking.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing