2026-06-06·5 min read·sota.io Team

Art.17 Right to Erasure: LLM Training Data Removal & RAG Vector Store Deletion 2026

Post #3 in the sota.io EU AI Act + GDPR Intersection Series

GDPR Art.17 Right to Erasure for LLM training data — machine unlearning and RAG vector store deletion compliance guide 2026

A data subject submits a deletion request under GDPR Art.17. Your team confirms the deletion from your PostgreSQL database, your S3 bucket, your CDN cache. Then someone asks: what about the LLM you trained on a dataset that included this person's data? What about the vector store your RAG pipeline uses that contains their documents?

This is the machine unlearning problem — and it is one of the hardest open questions at the intersection of EU data protection law and modern AI systems. In 2026, with the EU AI Act entering full enforcement for high-risk systems, the question has moved from academic research into compliance obligations.

This guide covers what Art.17 actually requires, what the EU AI Act adds for training data governance, and what developers can do today.

GDPR Art.17 — "Right to erasure ('right to be forgotten')" — gives data subjects the right to obtain erasure of personal data without undue delay when:

The data is no longer necessary for the purpose it was collected (Art.17(1)(a))
The data subject withdraws consent and there is no other legal basis (Art.17(1)(b))
The data subject objects under Art.21 and there is no overriding legitimate interest (Art.17(1)(c))
The data was unlawfully processed (Art.17(1)(d))
Erasure is required by EU or Member State law (Art.17(1)(e))
Data was collected in relation to a child's consent for information society services (Art.17(1)(f))

Art.17(3) exceptions allow continued processing despite an erasure request when it is necessary for:

Exercising the right of freedom of expression and information
Compliance with a legal obligation
Reasons of public interest in the area of public health (Art.9(2)(h) and (i))
Archiving, scientific, historical research, or statistical purposes in the public interest
Establishment, exercise or defence of legal claims

The critical word is "undue delay." The controller does not get an indefinite grace period — if erasure is technically difficult, that difficulty must be justified or mitigated by design.

The Machine Unlearning Problem

Neural networks learn by adjusting millions (or billions) of parameters based on training examples. Once a model is trained, there is no simple pointer to "data point X" inside the weights. The information from any single training example is distributed non-linearly across the entire parameter space.

This creates a structural problem for Art.17 compliance:

For base LLM fine-tuning: If you fine-tuned a model on a dataset that included personal data — customer service transcripts, employee emails, product reviews — and a data subject requests erasure, you cannot surgically remove their data from the weights without retraining.

For GPAI model providers: The scale is even more extreme. A GPAI model trained on web-scale data may contain personal data from billions of sources. The EU AI Act addresses this in Art.10 (data governance) but does not provide a technical solution.

For embedding models used in RAG: Vector embeddings of documents containing personal data are more tractable — individual embeddings can be deleted and re-indexed. But the embedding model's weights themselves may have been trained on personal data.

EU AI Act Art.10: Data and Data Governance

The EU AI Act Art.10 imposes data governance obligations for high-risk AI systems. For training data, Art.10(2) requires that:

Relevant design choices are documented
Training data is examined for possible biases
Appropriate measures are taken to detect, prevent and mitigate biases
Personal data is processed in accordance with applicable data protection law

Art.10(3) specifically notes that training, validation and testing datasets shall be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose."

Art.10(5) adds that "to the extent strictly necessary for the purposes of ensuring bias monitoring, detection and correction" a provider may process special categories of personal data — but only with appropriate safeguards under Art.9 GDPR.

The key compliance implication: you must document what personal data entered your training pipeline, under what legal basis, and whether any erasure requests have been handled. This documentation feeds into the EU AI Act Art.11 technical documentation and Art.17 quality management system obligations.

Practical Compliance Architecture

Tier 1: Training Data Governance (Proactive)

The most reliable approach to Art.17 compliance is preventing the problem before it occurs.

Data Subject Inventory for Training Data

Before training, document every source in your training dataset:

Source (URL, database table, third-party dataset)
Whether the source contains personal data
Legal basis for processing (consent, legitimate interest, contract)
Data subject categories affected
Retention and deletion procedures

This inventory is mandatory under EU AI Act Art.10(2) and feeds your technical documentation under Art.11. It also makes Art.17 requests tractable: you can determine whether a specific data subject's data was included in training.

Train-Time Tagging

If you maintain your training data in a structured format, tag each document or record with a unique data subject identifier. This enables:

Rapid identification of training data subject to erasure requests
Automated flagging when an Art.17 request arrives
Documentation for Art.10 compliance records

This approach works well for fine-tuning datasets where you control the source data. It does not work retroactively for models already trained on unstructured web data.

Data Minimisation at Ingestion

GDPR Art.5(1)(c) requires personal data to be "adequate, relevant and limited to what is necessary." For AI training:

Apply pseudonymisation or anonymisation before ingestion where possible
Remove unnecessary personal identifiers (names, email addresses, phone numbers) from training data that does not require them
Document the anonymisation method — inadequate anonymisation does not exempt you from Art.17 obligations

Technical note on anonymisation: Differential privacy during training provides a formal privacy guarantee and reduces (but does not eliminate) memorisation risk. EU regulators have not yet defined minimum epsilon values for GDPR compliance — document your choices.

Tier 2: Machine Unlearning Approaches

When a data subject requests erasure after training, several technical approaches exist. None is perfect; all require compliance documentation.

Full Retraining (Gold Standard, Expensive)

The most compliant approach: when an erasure request arrives, remove the relevant data from your training set and retrain the model from scratch. This provides a strong guarantee that the data is no longer in the model.

Practical for: fine-tuned models with manageable training costs (smaller models, shorter training runs).

Compliance documentation: retain your training run logs, the pre-erasure dataset hash, and the post-retrain dataset hash. This creates an auditable chain of evidence.

Not practical for: foundation GPAI models with training runs costing millions of euros.

SISA Training (Sharded, Isolated, Sliced, and Aggregated)

SISA training splits the dataset into shards and trains a constituent model on each shard. When an erasure request arrives, only the constituent model(s) whose shard contained the deleted data need to be retrained.

Benefits: O(1/k) retraining cost where k is the number of shards, with a full privacy guarantee equivalent to full retraining.

Requirements: you must implement SISA from the start of training — it cannot be applied retroactively to an already-trained model. Document the shard allocation and constituent model registry as part of your Art.10 technical documentation.

Approximate Unlearning (Lower Cost, Weaker Guarantee)

Several approximate unlearning algorithms have been proposed (gradient ascent on forgotten data, Fisher forgetting, influence function approximation). These modify the model weights to reduce (but not eliminate) the influence of specific training examples.

From a GDPR compliance perspective: approximate unlearning is not erasure. If the data remains statistically recoverable from the weights, the personal data has not been erased. You would need a formal privacy audit (e.g., membership inference attack testing) to demonstrate that the data is no longer extractable.

Use approximate unlearning as a risk mitigation, not a compliance solution. Document it clearly as such.

Documentation Template for Erasure Requests

Data Subject Erasure Request — AI Training Data
================================================
Request ID: [UID]
Request Date: [DATE]
Data Subject: [ANONYMIZED_ID]
Affected Training Dataset: [DATASET_NAME, VERSION]
Data Subject Data Present: [YES/NO]
Legal Basis for Original Processing: [BASIS]
Erasure Action Taken: [FULL_RETRAIN | SISA_RETRAIN | APPROXIMATE_UNLEARNING | EXCLUDED_FROM_NEXT_TRAINING | N/A]
Completion Date: [DATE]
Residual Risk Assessment: [NONE | LOW | MEDIUM + JUSTIFICATION]
Art.17(3) Exception Applies: [YES/NO + BASIS]
Reviewed By: [DPO_NAME]

Tier 3: RAG Vector Store Deletion

For RAG pipelines, the compliance picture is more tractable than for base model weights. See our earlier guide on RAG compliance and GDPR vector stores for the full architecture. Here we focus specifically on the Art.17 deletion workflow.

Vector Embedding Deletion

When a data subject requests erasure:

Query your document metadata store for all chunks derived from the data subject's documents
Delete those vector embeddings by ID from your vector store (all major stores — Qdrant, Weaviate, pgvector, Pinecone — support point-level deletion)
Delete the source documents from your document store
Verify deletion: re-run the similarity search queries that would have retrieved the deleted documents and confirm they return different results

# Example: Qdrant deletion workflow
from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)

def erase_data_subject(collection_name: str, data_subject_id: str):
    # Find all points belonging to this data subject
    result = client.scroll(
        collection_name=collection_name,
        scroll_filter={"must": [{"key": "data_subject_id", "match": {"value": data_subject_id}}]},
        limit=1000,
        with_payload=False,
        with_vectors=False
    )
    point_ids = [p.id for p in result[0]]
    
    if not point_ids:
        return {"deleted": 0, "status": "not_found"}
    
    # Delete all points
    client.delete(
        collection_name=collection_name,
        points_selector={"points": point_ids}
    )
    
    return {"deleted": len(point_ids), "status": "erased"}

Important: this deletes the stored vectors but does not affect the embedding model weights (which may themselves have been trained on this data). If your embedding model was fine-tuned on personal data, the machine unlearning problem applies to the embedding model as well.

Re-indexing After Deletion

Deleting individual vectors does not require full re-indexing in most vector databases. However, if your RAG pipeline caches document summaries or pre-computed responses that include the deleted data, those caches must also be invalidated.

Build a deletion propagation map:

Vector embeddings
Document store (original text)
Summary cache
Response cache
Search index (if you use a BM25 hybrid retriever)
Conversation history (if your system stores sessions)

Document this map in your Data Protection Impact Assessment under GDPR Art.35.

EU AI Act Art.17 Documentation Obligations

EU AI Act Art.17 (Quality Management System) requires high-risk AI providers to document their data governance procedures including "data acquisition, data collection, data analysis, data labelling, data storage, data filtration, data mining, data aggregation, data retention."

For erasure compliance, this means your QMS must include:

Data Retention Schedule: How long do you retain training data after training completes? If you retain it, you must handle erasure requests against the stored data AND against the model weights. If you delete the training data post-training, document this — it changes your erasure obligations (you cannot comply with retraining-based erasure if training data has been deleted).

Erasure Request SLA: How quickly can you retrain or implement unlearning? Document this as part of your Art.17 GDPR compliance procedures and reference it in your QMS under Art.17 EU AI Act. "Undue delay" in Art.17 GDPR has been interpreted as one month (with possible two-month extension for complex cases).

Audit Trail: Maintain a log of all erasure requests, actions taken, and completion dates. This is evidence of compliance for NCA inspections under the EU AI Act and for supervisory authority audits under GDPR.

Art.25 requires controllers to implement appropriate technical measures that integrate data protection principles into the processing itself. For AI systems:

Implement data subject tagging in training pipelines from the start (not as a retrofit)
Build erasure request workflows into your MLOps pipeline before first training run
Test your deletion procedures before you need them (erasure drills)
Ensure your vector stores have mandatory data_subject_id metadata on all embeddings

Documentation of your Art.25 compliance measures belongs in your DPIA (Art.35 GDPR) and your technical documentation (Art.11 EU AI Act).

The GPAI Special Case

For GPAI model providers regulated under the EU AI Act, Art.17 compliance for training data erasure faces an additional challenge: the scale of training data makes individual erasure requests practically infeasible at model weight level.

The European Data Protection Board (EDPB) has not yet issued specific guidance on how GPAI providers should handle Art.17 requests against model weights. The current regulatory position is that:

GPAI providers must document what personal data was used in training (Art.10 EU AI Act)
They must have a process for handling erasure requests
If full erasure from weights is technically infeasible, this must be documented with a residual risk assessment

Many GPAI providers currently handle this by committing not to include clearly-identified personal data in future training runs (rather than removing it from existing weights). This is not full Art.17 compliance — it is risk mitigation pending clearer guidance.

If you are building a GPAI system, engage your DPO and legal counsel on this point before August 2026. The EU AI Act enforcement authority coordination with national data protection authorities will likely sharpen this requirement.

Developer Compliance Checklist

Training Data Governance:

Inventory all training data sources with personal data flag
Document legal basis for processing personal data in training
Implement data subject tagging or pseudonymisation at ingestion
Apply differential privacy during training (document parameters)
Choose and document unlearning strategy (SISA / full retrain / approximate)
Define erasure request SLA (≤1 month for undue delay compliance)

RAG Vector Store:

Enforce mandatory data_subject_id metadata on all embeddings
Implement point-level deletion workflow with verification
Build deletion propagation map (vectors + source + cache + session history)
Test erasure workflow in staging before production use
Document deletion procedures in DPIA

Documentation (Art.10 EU AI Act + Art.11 Technical Docs):

Training dataset provenance log with personal data assessment
Erasure request log template (see template above)
QMS erasure procedure under Art.17 EU AI Act quality management system
Art.25 GDPR data protection by design measures documented
DPIA updated with machine unlearning risk assessment

Operational:

DPO briefed on machine unlearning capabilities and limitations
Erasure request intake process defined
Escalation path documented for technically complex requests (Art.12(3) GDPR: maximum three months for complex cases)
NCA notification procedure if erasure is not possible (residual risk documentation)

What This Means for Deployment on EU Infrastructure

If you deploy your AI system on EU-native infrastructure (servers and services located in Germany, France, Netherlands), you control the full data path: training data storage, model weights, vector stores, and inference logs.

This matters for Art.17 compliance because:

You can implement technical erasure without data crossing CLOUD Act jurisdiction
Your unlearning retraining runs do not export personal data to US-controlled infrastructure
Evidence of erasure (dataset hashes, retrain logs) remains in EU jurisdiction

Hosting on US cloud infrastructure — even with EU data centre contractual commitments — creates a legal complexity: a US government order under 18 U.S.C. § 2713 could compel access to your training data or model weights before you complete an erasure action. This does not eliminate your Art.17 obligations; it adds a jurisdictional risk layer that your DPIA should assess.

Next in the Series

This is Post 3 of 5 in the EU AI Act + GDPR Intersection series. Next:

Post 4: Data Minimisation in AI Training and Inference — Art.5(1)(c) GDPR + AI Training Dataset Compliance
Post 5: AI + GDPR Full Compliance Stack — DPO Role, Accountability, Audit Trail Finale

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing