Skip to content
Confir.
AI Risk Management

AI Data Leakage: Risk, Regulation, and What the EU AI Act Requires

Guide23 May 2026· 12 min read· 2,472 words

Six AI data-leakage vectors, the Article 15 and GDPR obligations they trigger, and the controls that close the gap. Updated for the Digital Omnibus.

Sensitive data escaping from an AI system is not a theoretical threat. It happens through routes that are easy to overlook — a model that has memorised fragments of its training set, a staff member pasting a client contract into a public chat tool, an embedding stored in a vector database with no access controls. Under EU law, every one of those routes creates exposure: Article 15 of the EU AI Act (Regulation (EU) 2024/1689) requires high-risk AI systems to be resilient against confidentiality attacks, and GDPR Articles 5 and 32 independently demand that personal data be kept secure.

This article maps the main leakage vectors, the regulatory hooks they trigger, and the practical controls that close the gap.


What AI data leakage actually means

"Data leakage" in AI refers to sensitive, personal, or proprietary data escaping the system in ways that were not intended or authorised. There are six distinct vectors:

Training-data memorisation. Large models can encode fragments of training data verbatim and reproduce them when prompted. If your training set contained medical records, salary data, or internal documents, a well-crafted query can elicit those fragments. The risk is asymmetric: you may not know what the model has memorised until someone extracts it.

Prompt and context leakage. In retrieval-augmented or tool-using systems, the prompt context often contains sensitive material — customer data pulled from a CRM, prior conversation turns, a retrieved contract. Without strict output filtering, those fragments can leak back to other users or into logs.

Embedding and vector-store exposure. Retrieval systems store documents as numerical embeddings. The original text can be approximated from those embeddings under certain conditions, and if the vector store is accessible beyond the intended users, the underlying data is effectively exposed.

Model-inversion and membership-inference attacks. A model-inversion attack uses the model's outputs to reconstruct examples that resemble training data. A membership-inference attack determines whether a specific individual's record was in the training set. Both are real, documented techniques with tooling available publicly. Article 15 addresses exactly this threat class under "confidentiality attacks."

Shadow AI: staff using public tools. The most common leakage route is the simplest. An employee pastes a draft contract, a spreadsheet of client names, or a business plan into a public-facing AI assistant that uses inputs for training. There is no technical exploit; the data leaves the organisation by deliberate (though well-intentioned) action. Article 4's AI literacy obligation — in force since 2 February 2025 — requires organisations to give staff the competence to use AI tools safely. Shadow AI is squarely within scope.

Over-broad tool and integration permissions. When an AI assistant is granted read access to email, calendar, and document stores to answer broad questions, it can inadvertently surface data from one user's context into another's, or write retrieved data to a location with looser access controls. Least-privilege scoping is both a security and a compliance control.


EU AI Act hooks: which articles apply

Article 15 — Accuracy, robustness, cybersecurity

Article 15 is the primary EU AI Act hook for data leakage in high-risk systems. It requires providers to design high-risk AI systems to withstand attempts to alter their behaviour or outputs by third parties (adversarial robustness) and, critically, attempts to exploit system vulnerabilities to access data or gain unauthorised access (cybersecurity, including confidentiality attacks). Model-inversion and membership-inference attacks fall directly under "confidentiality attacks" within Article 15's cybersecurity requirement.

Article 15 also cross-references the coordinated vulnerability reporting and disclosure requirements under the Cybersecurity Act (Regulation (EU) 2019/881), so relevant ENISA guidance applies alongside the AI Act.

Article 10 — Data and data governance

Article 10 governs how training, validation, and testing data are handled for high-risk AI systems. Providers must have data governance practices covering: the intended purposes of the data; possible biases; relevant data limitations; appropriate data collection methods; and whether the data is fit for the system's purpose. This is not staff training — that is Article 4 — but the governance of data pipelines themselves.

For data leakage, Article 10 matters in two ways. First, it requires minimisation in the data pipeline: if the training set is stripped of unnecessary personal data before training, the surface area for training-data memorisation shrinks. Second, a well-governed data pipeline with documented lineage makes it possible to audit what was in the training set if a leakage incident occurs.

Article 53 — GPAI providers: training-data summary

If you are a provider of a general-purpose AI (GPAI) model rather than a downstream application, Article 53 applies. Among other baseline obligations, Article 53(1)(d) requires GPAI providers to publish a sufficiently detailed summary of training data used — enabling downstream deployers to understand what data categories the model was trained on and assess leakage risk accordingly. GPAI model obligations under Chapter V have applied since 2 August 2025.

GDPR: Articles 5, 9, 32, 33, and 34

GDPR runs in parallel and must be applied with the AI Act, not instead of it.

GDPR Article 5(1)(f) (integrity and confidentiality) requires personal data to be processed in a manner that ensures appropriate security, including protection against unauthorised or unlawful processing and against accidental loss, destruction, or damage.

GDPR Article 9 sets heightened requirements for special-category data — health, biometric, racial or ethnic origin, trade union membership, and more. Many AI training datasets touch special-category data. Leakage of special-category data triggers a higher level of scrutiny under both GDPR and the AI Act's Annex III high-risk categories (biometrics at point 1; healthcare at point 5 for insurance risk assessments).

GDPR Article 32 requires controllers and processors to implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk — encryption, pseudonymisation, ongoing confidentiality, integrity, availability, and resilience of systems.

GDPR Articles 33 and 34 require notification to supervisory authorities within 72 hours of becoming aware of a personal data breach (Article 33) and, where the breach is likely to result in high risk to individuals, direct notification to those individuals (Article 34). If a data-leakage incident from an AI system involves personal data, both notification obligations are triggered alongside any AI Act incident reporting under Article 73.


Where high-risk systems face the most exposure

If your AI system sits in an Annex III category — recruitment or performance evaluation (point 4), creditworthiness scoring (point 5(b)), health or life insurance risk assessment (point 5(c)), or biometric identification (point 1) — the data it processes is by definition sensitive, and leakage is a first-order risk, not a secondary one.

A recruitment system trained on historical CV data may encode candidate names, addresses, and employment histories in its weights. A credit-scoring model with API access to applicant transaction data creates an inference-log leakage vector if logs are not encrypted and access-controlled. A healthcare insurer's risk-assessment tool processing medical histories under Annex III point 5(c) is handling special-category GDPR data, so leakage simultaneously triggers GDPR Article 9 and AI Act Article 15.

The deadline for these Annex III stand-alone systems is 2 December 2027, under the Digital Omnibus agreed in May 2026 (the original 2 August 2026 date was deferred). That is more breathing room than was available six months ago — but the data governance work to satisfy Article 10, and the cybersecurity controls to satisfy Article 15, are not quick projects.


Mitigations that satisfy both frameworks

Data classification and minimisation

Before any other control, classify your data: what is personal, what is special-category, what is commercially confidential, what is genuinely public. Then apply minimisation at every stage — collection, preprocessing, training, logging. Data that is not in the system cannot leak from it.

Under Article 10, providers must document that training data is appropriate for the intended purpose; minimisation is the natural output of that discipline.

DLP controls and shadow AI policy

Deploy data loss prevention (DLP) tooling that detects and blocks transmission of sensitive content to external AI services. Pair it with a written policy covering which tools staff may use for which categories of data, and ensure it is backed by Article 4 AI literacy training. A policy without technical controls is not sufficient, and technical controls without staff understanding generate workarounds.

Contractual safeguards: no-training clauses and DPAs

If you use a third-party AI service, review its terms before sending any personal or confidential data to it. You need: a clear statement that the provider will not use your inputs for model training; a Data Processing Agreement (DPA) under GDPR Article 28 if personal data is involved; and ideally EU data residency or at least an EU-adequate transfer mechanism if the service is outside the EU/EEA.

These are not optional enhancements. Sending special-category personal data to a service without a DPA is a GDPR Article 28 breach in its own right, independent of whether a leakage incident occurs.

Redaction, anonymisation, and pseudonymisation

Not all use cases require identifiable data. Where your AI system's purpose can be served by anonymised or pseudonymised data, apply it before training and before inference logging. Truly anonymised data falls outside GDPR's scope entirely; pseudonymised data still triggers GDPR but reduces breach severity materially.

Access controls and least-privilege scoping

Apply role-based access control to training datasets, model weights, inference APIs, and logs. Audit access quarterly. Where AI integrations have tool permissions (email, document stores, CRM), scope them to the minimum necessary for the task. Every permission granted to an AI assistant beyond its immediate need is an additional leakage vector.

Vendor due diligence

Before deploying a third-party AI system in an Annex III context, ask the vendor for their Article 11 technical documentation (or equivalent) and their Article 15 cybersecurity architecture summary. A vendor that cannot provide these under a forthcoming high-risk context is giving you a signal about their compliance posture. Under Article 25, if you substantially modify an AI system or deploy it for a purpose outside the original intended use, you may take on provider obligations yourself — including responsibility for data-leakage controls.


How Confir helps

When you complete an assessment in Confir for a high-risk system, the Data and Technical Robustness area (AITR) records your data-handling controls — including training data governance under Article 10 and cybersecurity measures under Article 15. The assessment asks plain-English questions about data minimisation, access controls, no-training commitments from vendors, and logging practices; the rule-based engine maps your answers to the Act's requirements and flags gaps. The output feeds directly into your Article 11 / Annex IV technical documentation pack.


Frequently Asked Questions

What is AI data leakage under the EU AI Act?

AI data leakage refers to sensitive, personal, or proprietary data escaping a system in unintended ways — through training-data memorisation, prompt or context exposure, embedding leakage, model-inversion or membership-inference attacks, staff pasting data into public tools (shadow AI), or over-broad integration permissions. Under the EU AI Act, high-risk AI systems must be resilient against confidentiality attacks (Article 15); under GDPR, controllers must implement appropriate security measures (Article 32) and notify supervisory authorities within 72 hours of a personal data breach (Article 33).

Which EU AI Act article covers data leakage most directly?

Article 15 is the primary hook: it requires high-risk AI systems to be resilient against adversarial manipulation and confidentiality attacks, which expressly includes model-inversion and membership-inference techniques. Article 10 governs the upstream data governance that prevents sensitive data from entering the training pipeline unnecessarily. For GPAI model providers, Article 53(1)(d) requires a training-data summary. GDPR Articles 5(1)(f) and 32 apply in parallel for any system processing personal data.

Does GDPR or the EU AI Act apply when data leaks from an AI system?

Both can apply simultaneously. GDPR Article 32 requires appropriate security; Articles 33/34 require breach notification if the leak involves personal data. The EU AI Act's Article 15 requires providers of high-risk systems to have cybersecurity controls against confidentiality attacks. A leakage incident from a high-risk AI system — say, a credit-scoring model exposing applicant transaction data — would trigger GDPR breach-notification obligations and could constitute a failure of Article 15 cybersecurity requirements under the AI Act. They are not alternatives; they stack.

What is shadow AI and why does it matter for data leakage?

Shadow AI is staff using public AI tools — chat assistants, summarisation tools, image generators — without organisational oversight or policy coverage. The immediate leakage risk is that confidential or personal data typed into a public tool may be used for model training or logged in ways beyond your control. The EU AI Act's Article 4 AI literacy obligation (in force since 2 February 2025) requires organisations to ensure staff have sufficient competence to use AI tools safely. A shadow AI policy backed by technical DLP controls and Article 4 training is the minimum response.

What contractual safeguards reduce AI data leakage risk?

Three are essential when using any third-party AI service for personal or confidential data: a no-training-on-input clause (the provider commits not to use your data to retrain their model); a GDPR-compliant Data Processing Agreement under Article 28 of GDPR; and, for special-category data, a clear legal basis and an appropriate transfer mechanism if the service is outside the EEA. Absent a DPA, you are processing personal data without adequate safeguards — a standalone GDPR breach independent of any leakage incident.

When do Annex III high-risk AI systems need to comply with data-leakage controls?

Under the Digital Omnibus agreed in May 2026, the compliance date for stand-alone Annex III high-risk systems is 2 December 2027 (deferred from the original 2 August 2026). High-risk AI embedded in Annex I regulated products (medical devices, machinery) must comply from 2 August 2028. Article 4 AI literacy obligations — directly relevant to shadow AI — have been in force since 2 February 2025. GPAI baseline obligations under Article 53 have applied since 2 August 2025.

What fines apply to AI data leakage failures under the EU AI Act?

Failures of high-risk obligations — including Article 10 data governance and Article 15 cybersecurity — are subject to the €15,000,000 or 3% of total worldwide annual turnover tier under Article 99(4), whichever is higher. For companies with smaller turnovers, under Article 99(6), the fine is capped at the lower of the percentage and the fixed ceiling — a proportionality protection for smaller companies. GDPR fines (up to €20M or 4% under GDPR Article 83(5) for Article 9 special-category violations) apply separately.


Related guides

Manage your EU AI Act compliance in one place

Confir automates risk classification, technical documentation, and audit trails for any company. No consultants. No 6-month projects. 7-day free trial.

Start free trial →