Article 15 EU AI Act: Accuracy, Robustness, and Cybersecurity for High-Risk AI
Article 15 EU AI Act: accuracy metrics, adversarial robustness, feedback loops, and cybersecurity for high-risk AI. Obligation from 2 December 2027.
Article 15 of Regulation (EU) 2024/1689 requires that high-risk AI systems be designed, developed, and tested to achieve appropriate levels of accuracy, robustness, and cybersecurity — and that compliance is maintained throughout the system's lifecycle, not just at market entry. The obligation falls primarily on providers (those placing high-risk AI on the EU market under their own name), and it is backed by a penalty ceiling of €15 million or 3% of global annual turnover, whichever is higher, for non-compliance with high-risk requirements.
High-risk obligations for stand-alone Annex III systems apply from 2 December 2027 under the Digital Omnibus political agreement of May 2026 — pushed back from the original 2 August 2026 date. That extension is breathing room, not a reprieve. The testing, documentation, and monitoring work that Article 15 requires takes longer than most teams expect, and the underlying systems often need architectural changes before they can satisfy the lifecycle framing.
This guide covers the statutory text behind each Article 15 pillar, what providers and deployers must do differently, the specific attack categories regulators expect you to address, and the five compliance gaps that show up most often in practice.
What Article 15 Actually Says
Article 15(1): the lifecycle baseline
Article 15(1) sets the fundamental requirement: high-risk AI systems must be designed and developed to achieve an appropriate level of accuracy, robustness, and cybersecurity, and to perform consistently in those respects throughout their lifecycle. The lifecycle framing is not decorative. It means pre-market testing is necessary but not sufficient — post-deployment monitoring, drift response, and security patching are all part of what Article 15(1) demands.
The appropriate level is calibrated to the intended purpose and risk context. The Act does not prescribe universal floors (no "95% minimum accuracy" appears anywhere in Article 15). A high-risk system screening job applicants faces different accuracy demands than one scoring creditworthiness, and both face different demands than a law-enforcement risk-assessment tool.
Article 15(2): benchmarks and measurement methodologies
Article 15(2) directs the Commission, working with the AI Office and relevant bodies, to encourage development of benchmarks and measurement methodologies for accuracy and robustness of high-risk AI systems. This is an institutional mandate on the Commission, not a direct obligation on providers. In practice, it means harmonised test standards for specific Annex III domains — credit scoring, medical imaging, recruitment screening, biometric identification — are likely to emerge as the Act matures. Where recognised domain benchmarks already exist (ISO/IEC 29003 for biometrics, NIST AI 100-1 metrics for facial recognition), providers should use them and document the choice in Annex IV.
Article 15(3): the detailed requirements
Article 15(3) is where the substantive obligations live. It covers three areas:
Robustness: the system must be resilient to errors, faults, and inconsistencies — including those arising from interaction with natural persons or other AI systems. Where technically feasible, providers must achieve this via technical redundancy solutions, including backup or fail-safe plans.
Feedback loops: systems that continue to learn after deployment must be developed to eliminate or reduce, as far as possible, the risk of possibly biased outputs influencing future inputs. Mitigation measures for these loops must be documented.
Cybersecurity: the system must be resilient against attempts by unauthorised third parties to alter its use, outputs, or performance by exploiting vulnerabilities. Providers must put in place measures to prevent, detect, respond to, resolve, and control attacks. The Act names specific attack categories: data poisoning, model poisoning during training, adversarial examples / model evasion targeting inputs, confidentiality attacks, and model flaws. Solutions must be appropriate to the circumstances and risks.
Accuracy levels and the relevant accuracy metrics must be declared in the instructions for use accompanying the system under Article 13. This cross-reference is material — deployers rely on those declared metrics to set production monitoring thresholds.
Accuracy: Declaring and Measuring Performance
Accuracy under Article 15(1) means the system correctly performs its declared purpose, reliably and consistently, under normal and reasonably foreseeable conditions. The Act requires accuracy to be appropriate to the intended purpose and declared in the instructions for use. What that means in practice depends entirely on the use case.
For a recruitment screening tool at a 40-person HR-tech company, accuracy might mean no statistically significant performance gap — say, more than five percentage points — between protected demographic groups on a representative test cohort. For a credit-scoring model at a regional lender, it might mean mean absolute error within a defined currency band, reported separately for different applicant segments. For a medical imaging classifier, it might mean sensitivity of 95% and specificity of 92% on CT scans drawn from the demographic spread the system will actually encounter. The provider defines what appropriate means, documents it, and demonstrates it through testing.
Choosing and justifying metrics
Overall accuracy masks failure modes. A fraud-detection model reporting 94% accuracy may still flag a minority demographic at three times the rate of the majority — and that asymmetry is what an auditor will ask about. Annex IV §5 (Performance Metrics) requires providers to list all metrics, thresholds, and the rationale for each, including confidence intervals or error bounds where relevant.
Metric selection depends on the task:
| Task type | Typical metrics | Key watch-out |
|---|---|---|
| Binary classification | Precision, recall, F1, AUC-ROC | Per-subgroup recall gaps; imbalanced classes |
| Multi-class | Macro/weighted F1, confusion matrix | Class imbalance hiding minority-class failures |
| Regression | RMSE, MAE, R² | Tail-error distribution; outlier sensitivity |
| Ranking / recommendation | NDCG, MAP | Position-k performance; demographic rank disparity |
| NLP outputs | Task-specific evaluation; human review | Semantic fidelity; prompt-injection sensitivity |
A single metric without context — "92% accuracy" with no specification of metric type, test set, or subgroup breakdown — will not satisfy a notified body or a market surveillance authority. Annex IV §5 requires the full picture.
Representative datasets and stratification
Testing on a curated or homogeneous dataset hides real-world performance. Annex IV §6 (Testing and Validation Procedures) requires documentation of test dataset size, source, representativeness, and results stratified by demographic group, geography, or other relevant subpopulations. A high-risk hiring AI tested only on candidates from one EU member state will not satisfy this requirement when deployed across the EU's linguistic and demographic diversity.
Common mistakes here include: drawing the test set from the same source as the training set (introducing distribution leakage), not stratifying by protected characteristics explicitly, and treating synthetic data as a substitute for real-world representative samples without documenting the limitations.
Accuracy drift is a lifecycle obligation
Accuracy declared at market entry can deteriorate silently. Data drift occurs when the input distribution shifts away from the training data — applicant demographics change, economic conditions alter fraud patterns, patient populations at a clinic expand. Concept drift occurs when the relationship between inputs and the target outcome changes: hiring criteria evolve, regulatory definitions shift, the meaning of a "high-risk applicant" in credit scoring changes with macroeconomic conditions.
Both are foreseeable, and Article 15(1)'s lifecycle requirement covers both. Deployers must establish monitoring thresholds — for example, "accuracy must remain above 88% on a rolling 30-day window; if it drops below that, trigger a review within five working days" — and document the cadence and responsible role in Annex IV §9. Article 72, which governs post-market monitoring by providers, and Article 26, which sets deployer obligations to follow the instructions for use, together make continuous performance oversight mandatory. A static accuracy claim from 2025 does not satisfy a 2027 audit.
Robustness: Resilience Across the Lifecycle
Robustness is the ability to maintain performance when the system is exposed to errors, faults, inconsistencies, adversarial inputs, or environmental change. Article 15(3) identifies the specific requirements.
Resilience to errors, faults, and inconsistencies
The system must handle degraded inputs without catastrophic failure. Missing fields, corrupted values, out-of-range sensor readings, inconsistencies arising from interaction with natural persons or other AI systems — all of these are foreseeable in production. Technical solutions include input validation (schema enforcement, range checks, type checking before input reaches the model), anomaly detection that routes uncertain cases to human review, and fail-safe mechanisms that halt inference and trigger a safe fallback when inputs are outside the operating envelope.
Where a fail-safe or redundancy solution is not technically feasible, the provider must document why, and compensate through other controls. Article 15(3) uses "where technically feasible" specifically for the redundancy requirement — it is not a blanket escape from the robustness obligation.
Adversarial inputs and the named attack categories
Article 15(3) explicitly covers attempts to alter the system's outputs through crafted inputs and identified security exploitation. The Act names specific attack categories:
Adversarial examples / model evasion: inputs crafted to cause misclassification — pixel-level perturbations on image classifiers, synonym substitutions on text models, edge-case queries designed to trigger incorrect outputs. The attack does not require access to the model; black-box evasion attacks are well-documented against commercial systems.
Data poisoning during training: malicious samples injected into training or fine-tuning data to degrade or bias the model's outputs. In a recruitment AI, this means mislabelled examples that shift the model's hiring recommendations; in a credit-scoring model, it means falsified historical records that alter risk weights. Data provenance and integrity controls are the primary defence.
Model poisoning: tampering with the model weights or parameters directly — replacing checkpoint files, modifying deployed model artefacts. Access controls on model storage and immutable deployment artefacts are the key mitigations.
Adversarial examples targeting inputs / model evasion: overlaps with the above but specifically addresses the inference-time attack surface, where an attacker sends carefully crafted queries to cause the model to misclassify or output incorrect decisions.
Confidentiality attacks: membership inference (probing whether a specific individual appeared in the training set) and model extraction (using high-volume inference queries to reproduce the model). These target both privacy and intellectual property.
Model flaws: systematic errors in decision logic that can be triggered by specific input patterns — not adversarial in the traditional sense, but exploitable.
Testing must include adversarial perturbation campaigns, not just clean-data benchmarks. For a recruitment AI, this means running prompt-injection attempts against free-text fields and evaluating whether the ranked output shifts. For an image-based medical classifier, it means applying FGSM or PGD perturbations and documenting the degradation curve. The scope and depth of adversarial testing scale with the risk level and deployment context — a system processing thousands of employment decisions daily faces a higher adversarial threat model than one used in a low-volume internal workflow.
Technical redundancy and fail-safe plans
Where technically feasible, Article 15(3) requires providers to achieve robustness via technical redundancy: backup systems, fail-safe mechanisms that route uncertain cases to a human reviewer, or circuit breakers that halt inference when confidence falls below a defined threshold. These solutions must be documented in Annex IV and linked to the Article 9 risk management plan — robustness controls are not standalone engineering decisions, they are risk mitigations that must be explicitly mapped to identified risks.
Feedback loops in systems that continue to learn
Any high-risk system that continues to learn after deployment — a model that retrains on production outcomes, a recommendation engine that updates weights based on user interactions — must be developed to eliminate or reduce, as far as possible, the risk that biased outputs influence future inputs. This is Article 15(3)'s feedback-loop provision.
The risk is concrete. A hiring model that retrains on historical hiring decisions will amplify whatever demographic patterns exist in those decisions. A credit-scoring model that updates on repayment outcomes will reinforce lending patterns that already reflect historic discrimination. The feedback loop is not hypothetical; it is the documented failure mode of several deployed ML systems.
Mitigation measures include: periodic bias audits run before any retraining cycle, input-weighting controls that limit the influence of model-generated outputs on the retraining dataset, hold-out validation sets drawn independently of recent production outputs, and human-review gates before retrained models go live. These measures — and the evidence that they are operating as intended — belong in Annex IV §6 and §9.
Cybersecurity: Protecting Against AI-Specific Exploitation
Cybersecurity under Article 15(3) is not general IT security applied to an AI system. It is specifically about resilience against attempts by unauthorised parties to alter the system's use, outputs, or performance by exploiting vulnerabilities in the AI system itself. The attack surface is different from conventional software, and the controls must address it explicitly.
The AI-specific attack surface
| Attack type | What it targets | Example |
|---|---|---|
| Data poisoning | Training pipeline | Mislabelled samples injected into fine-tuning data |
| Model poisoning | Model weights / parameters | Replacing checkpoint files post-training |
| Adversarial examples / evasion | Inference inputs | Pixel perturbations causing misclassification |
| Confidentiality / membership inference | Training data | Queries probing whether a specific individual was in the training set |
| Model extraction | Proprietary model | High-volume inference to reproduce the model externally |
| Supply-chain compromise | Upstream dependencies | Pre-trained model or third-party library contains a backdoor |
Core technical measures
Providers must document controls across four domains in Annex IV §8:
- Encryption: data at rest (model weights, training datasets, inference logs) and in transit; key management procedures and key rotation cadence.
- Access control: multi-factor authentication on all systems touching model weights and training data; role-based access with least-privilege enforcement; separate credentials for training, evaluation, and inference environments.
- Audit logging: immutable logs of all model access, configuration changes, and data queries; defined retention period; access to the logs restricted to authorised roles.
- Vulnerability management: penetration testing on a defined cadence; patch timelines for critical vulnerabilities; dependency scanning for upstream libraries and pre-trained models used as components.
An incident response plan must also appear in Annex IV §8: who is notified when a security incident affecting the AI system is detected, within what timelines, who makes the decision to suspend the system, and how the provider notifies deployers and (where relevant) market surveillance authorities.
Measures appropriate to circumstances and risks
Article 15(3) qualifies the cybersecurity requirement with "solutions appropriate to the circumstances and risks." A small HR-tech company shipping a CV-screening tool faces a different threat model than a financial institution running a credit-scoring model for retail lending. The former needs rigorous access controls and data provenance documentation; the latter may face active adversarial interest and needs penetration testing by a qualified third party, documented model extraction defences, and a formal incident response procedure.
For companies without in-house security expertise, a qualified third-party penetration test typically costs €3,000–€8,000 for a high-risk SaaS application. For medical or safety-critical systems requiring notified body involvement under Article 43, the conformity assessment itself covers cybersecurity review.
Relationship to NIS2 and ISO 27001
Article 15 cybersecurity requirements deliberately overlap with the NIS2 Directive (Directive (EU) 2022/2555), which applies to operators of essential services and digital infrastructure — several Annex III high-risk categories fall within NIS2 scope, including critical infrastructure (category 2) and essential services including health and banking (category 5). If your organisation is already NIS2-obligated, your existing security management framework addresses many of the Annex IV §8 requirements. ISO 27001 certification provides a strong baseline.
Neither NIS2 nor ISO 27001, however, covers AI-specific attack vectors: data poisoning, model poisoning, model extraction, adversarial examples, membership inference. Those gaps must be addressed explicitly in your Article 15 documentation. You cannot point to an ISO 27001 certificate and declare Article 15 cybersecurity satisfied.
Integration with Article 9 risk management
Cybersecurity controls must be integrated into the Article 9 risk management system. The risk management plan must identify AI-specific cybersecurity risks, evaluate their likelihood and severity, document the controls adopted to mitigate them, and define monitoring and re-evaluation triggers — for example, a new class of adversarial attack published in the research literature, or a post-market incident report from another deployer of the same system. Article 9 requires this process to be iterative and documented throughout the lifecycle, not completed once at launch.
Provider and Deployer Responsibilities
Article 15 places the design and pre-market testing obligations on providers. Article 26 requires deployers to use high-risk systems in accordance with the instructions for use — which includes monitoring against the accuracy levels and metrics that providers declare under Article 15 and Article 13.
What providers must do before market entry
- Design and validate accuracy, robustness, and cybersecurity appropriate to the intended purpose.
- Declare accuracy levels and the relevant metrics in the instructions for use (Articles 13 and 15(3)).
- Complete Annex IV §5–8: performance metrics, testing and validation procedures, post-market monitoring plan, and security measures.
- For systems that continue to learn after deployment: document feedback-loop mitigation measures in Annex IV §6 and §9.
- Issue the Article 47 Declaration of Conformity (Annex V) before placing the system on the market.
- Retain technical documentation for at least ten years from market entry (Article 18).
What deployers must do in production
- Verify before deployment that the system's declared performance properties hold in their specific operational context, data environment, and user population. A system accurate on the provider's test set may drift when deployed to a different geographic or demographic population.
- Establish monitoring KPIs with defined thresholds and a named responsible role.
- Detect and investigate accuracy drift or security incidents.
- Report serious incidents to the relevant market surveillance authority under Article 73, within 15 days where harm has occurred or is likely.
- Retain monitoring records for at least three years under Article 72.
The modification trap: when deployers become providers
A deployer that retrains the model on new data, changes input preprocessing, or alters the decision boundary in a way that affects performance or intended purpose has made a substantial modification under Article 3(23). That change triggers full provider obligations: new technical documentation, fresh robustness and cybersecurity testing, a new conformity assessment under Article 43 where required, and a new Declaration of Conformity. This is the most common Article 15 compliance trap in practice.
Teams retrain a high-risk model to improve performance on their specific applicant population, then assume the original provider's Declaration of Conformity still covers them. It does not. The modification triggers provider status for the modified version under Article 25, with all the Article 16 obligations that follow.
Responsibility matrix
| Obligation | Provider | Deployer |
|---|---|---|
| Design accuracy, robustness, cybersecurity | Required | — |
| Declare accuracy metrics in instructions for use | Required | — |
| Test adversarial robustness | Required | Verify in own context |
| Document feedback-loop mitigations (continuous-learning systems) | Required | — |
| Produce Annex IV technical documentation | Required | — |
| Issue Declaration of Conformity (Annex V) | Required | — |
| Monitor accuracy in production | Post-market plan (Art 72) | Required (Art 26) |
| Detect and respond to drift | Required | Required |
| Report serious incidents (Article 73) | Required | Required |
| Retain records for at least 3 years (Article 72) | Required | Required |
| Assume provider obligations if substantially modified | — | Required (Art 25) |
Worked example: a 35-person HR-tech company
A small HR-tech firm builds a CV-screening tool that ranks applicants by role-fit score for its clients' hiring workflows. The tool is high-risk under Annex III category 4 (employment; workers management; access to self-employment).
As provider, the firm must: select accuracy metrics that capture per-demographic performance (balanced F1 across gender, age, and ethnicity groups; no subgroup gap above five percentage points); run adversarial tests including prompt-injection attempts on free-text CV fields; document the data provenance and integrity controls over its training pipeline (data-poisoning defence); specify the fail-safe behaviour when confidence is below threshold (route to human reviewer); declare all of this in the instructions for use; and complete the Annex IV technical documentation and Declaration of Conformity before market entry.
Its deployer clients — HR departments that integrate the tool — must monitor accuracy in their own applicant pool, not only against the provider's benchmarks. If a deployer observes that accuracy on its specific population is consistently five percentage points lower than declared, it must investigate and notify the provider. If the deployer retrains the model on its own historical hiring data to try to close that gap, it becomes a provider for that retrained version and must produce its own technical documentation and Declaration of Conformity.
Building the Article 15 Test Record: What Goes in Annex IV
Article 15 does not operate in isolation — its output is a set of evidence artefacts that must appear in the Article 11 technical documentation (structured by Annex IV) and that underpin the Article 47 Declaration of Conformity. Getting the structure right matters: a notified body or market surveillance authority will check Annex IV §5–8 directly against the Article 15 requirements.
Annex IV §5: Performance Metrics
This section must specify:
- Every accuracy metric used, with the rationale for choosing it over alternatives. A binary classification model might report precision, recall, and F1 — but the documentation must explain why F1 was chosen as the primary metric (class balance, cost of false negatives vs. false positives) rather than accuracy alone.
- Target thresholds for each metric, with confidence intervals where feasible. "Recall ≥ 0.90 at 95% confidence on a test set of 10,000 samples" is compliant. "Recall is generally high" is not.
- Per-subgroup performance, broken down by at least the protected characteristics relevant to the deployment context. For an employment AI, that means gender, age group, and any ethnicity breakdowns possible given the test data. For a credit-scoring model, it means income bracket, geography, and any demographic proxies used in the model.
- Accuracy metrics for robustness: the degradation curve under adversarial perturbation (e.g., F1 at perturbation intensity 0.01, 0.05, 0.1), performance on out-of-distribution samples, and — for systems that continue to learn — the bias metric before and after the first retraining cycle.
Annex IV §6: Testing and Validation Procedures
This section documents how the §5 numbers were generated. It must include:
- Test dataset description: size, source, collection period, any known limitations. A test set of fewer than 500 samples for a system making individual employment decisions will draw scrutiny. Sample sizes of 5,000+ for binary classification tasks, with at least 200 samples in each minority subgroup, are defensible.
- Methodology for adversarial testing: which attack methods were applied (e.g., FGSM with ε = 0.03 for image perturbations; template-based prompt injection for free-text inputs), the pass/fail criteria, and the results. For a recruitment AI with no image component, the adversarial testing scope might be limited to prompt injection and input-schema fuzzing — that is acceptable if the scope limitation is documented and justified.
- For systems trained on personal data: documentation of the data-poisoning controls in the training pipeline — how the training data was validated, whether data provenance was checked, and what access controls protected the training set.
- If synthetic data was used to augment the test set: the generation methodology, its known distributional differences from real-world data, and what this means for interpreting the reported metrics.
Annex IV §7: Post-Market Monitoring Plan
Post-market monitoring is specified in Article 72, but the plan itself is documented here. It must identify:
- Which accuracy and robustness KPIs will be tracked in production, with defined thresholds that trigger a review.
- Monitoring cadence — weekly for high-throughput employment or credit-scoring systems; monthly for lower-volume deployments — and the tooling or process used to collect production metrics.
- The responsible role and escalation path. Not "the operations team" — a named role with a named backup.
- The retraining trigger: what combination of drift magnitude and time window triggers a retraining proposal, who approves it, and what validation gate must be passed before the retrained model replaces the live version.
- The Article 73 serious-incident reporting procedure: what counts as a serious incident (unexpected harm, significant accuracy failure affecting an individual's rights), who files the report, and the 15-day notification deadline.
Annex IV §8: Cybersecurity Measures
This section is where the AI-specific threat model lives. It must document:
- The threat model: each attack vector assessed, its likelihood given the deployment context, the potential impact, and the control measures adopted. A template row: "Data poisoning — Medium likelihood (training data sourced from third-party providers); High impact (model bias affecting hiring decisions); Control: third-party data validation checklist + SHA-256 hash verification of training dataset before each training run."
- Technical controls in place, with enough specificity for an auditor to verify implementation: encryption standards and key management, access control architecture with role definitions, audit log scope and retention, dependency scanning tools and cadence.
- Penetration testing: who conducted it (internal or third-party), the date, the scope (model API, training infrastructure, inference endpoint), the findings, and the remediation status of any findings.
- The incident response plan: detection (what monitoring alerts on a potential attack), containment (who is authorised to suspend inference), investigation (who conducts the post-incident review), notification (which authorities and deployers are told, within what timelines), and recovery (what evidence is preserved before systems are restored).
Keeping the documentation current
Annex IV is not a one-time artefact. Every time the system is materially changed — retrained, redeployed to a new environment, updated with a new model version — the documentation must be reviewed. Sections that change must be updated and version-stamped. The Declaration of Conformity references the Annex IV version; if Annex IV changes substantially without re-issuance of the DoC, the DoC becomes misleading. Under Article 3(23), any change that affects the intended purpose, performance, or risk profile of the system is a substantial modification — which means a new conformity assessment and a new DoC, not just a documentation update.
For systems that continue to learn, this means the Annex IV update cadence should be tied to the retraining cycle. Before each retraining run: check whether feedback-loop mitigations are still operating correctly. After each retraining run: update the §5 and §6 performance metrics with the new model's test results, and review whether any §8 cybersecurity controls need adjusting for the updated training data.
Five Common Article 15 Compliance Gaps
Gap 1: Testing on clean data only
Providers validate accuracy on a high-quality held-out test set and declare compliance. Real-world deployment introduces minority populations, corrupted inputs, out-of-distribution examples, and adversarial queries that never appeared in the benchmark. Article 15(3) requires resilience to errors, faults, and inconsistencies. A model achieving 95% accuracy on a curated benchmark may fail at 30% on edge-case inputs the benchmark excluded.
The fix: expand the test matrix to cover minority subgroups, out-of-distribution samples, at least one category of adversarial perturbation relevant to the deployment context, and any system-interaction scenarios (e.g., inputs generated by another AI system upstream). Document all of this in Annex IV §6, including sample sizes, pass/fail criteria, and limitations.
Gap 2: Static accuracy declaration with no post-deployment monitoring
Providers report a single metric at market entry and assume ongoing compliance. Article 15(1) requires performance throughout the lifecycle; Article 72 mandates a post-market monitoring system with active data collection. A static accuracy claim does not satisfy either. The fix: establish a monitoring dashboard with defined KPIs — accuracy, drift indicators, adversarial-attack detection rate, security anomalies — a review cadence, a named responsible role, and a documented trigger for escalation. Record this in Annex IV §9.
Gap 3: IT security review with no AI-specific threat analysis
Providers hand robustness and security testing to the IT or InfoSec team. Infrastructure controls — firewalls, encryption, access management — get addressed. AI-specific vectors — data poisoning, model poisoning, adversarial examples, membership inference, model extraction — do not appear in the standard IT risk register and get missed. Annex IV §8 is incomplete without them. The fix: run a joint risk assessment across IT and the AI development team; map each AI-specific threat to likelihood, impact, and mitigation; document the full threat model in Annex IV §8.
Gap 4: Post-market monitoring plan with no named owner
Providers document a monitoring cadence in Annex IV §9 but assign it to "the compliance team" rather than a named role with explicit responsibilities. When drift occurs or a serious incident arises, no one owns the escalation. Article 73 requires notification within 15 days of discovering a serious incident. Without a named owner and an escalation procedure, that deadline is impossible to meet. The fix: assign a named role — an AI Compliance Officer or equivalent — with documented responsibilities for KPI review, incident triage, and Article 73 reporting.
Gap 5: Declaration of Conformity not re-issued after retraining
Providers issue a Declaration of Conformity at market entry, then retrain the model or change the deployment environment, but do not update Annex IV or re-issue the DoC. The original DoC becomes stale and misleading. A substantial modification under Article 3(23) triggers a new conformity assessment under Article 43(4) and a new Declaration of Conformity with a new version number and date. The fix: implement version control on the Annex IV file; establish a re-issuance trigger for any change that materially affects accuracy, robustness, cybersecurity, or intended purpose; archive prior versions in the audit log.
How Confir Supports Article 15 Compliance
Article 15 sits within Confir's AITR area (Data & Technical Robustness, covering Articles 10, 11, and 15). The AITR-02 control — Accuracy, Robustness & Cybersecurity — walks providers and deployers through a structured assessment that addresses each Article 15 requirement: accuracy metric selection and declaration, the adversarial threat inventory covering the Act's named attack categories, the feedback-loop mitigation record for systems that continue to learn after deployment, and the cybersecurity controls checklist mapped to Annex IV §8.
Completing the AITR-02 assessment generates findings that populate the Annex IV §5–8 sections of the Conformity Package — the same artifact underlying the Article 47 Declaration of Conformity. The Compliance Health Score flags incomplete robustness or cybersecurity assessments as open findings in the risk register. Confir's classification and scoping engine is rule-based and deterministic: the same intake produces the same Article 15 finding every time, which matters when an auditor asks you to explain the compliance logic behind a control finding.
For companies that are deployers rather than providers, the AITR-02 assessment scopes to the monitoring and verification obligations — production KPI thresholds, incident-response role assignment, and the substantial-modification trigger that converts a deployer into a provider under Article 25.
Related guides
- EU AI Act compliance requirements
- high-risk AI system requirements
- AI governance compliance platform
- document your AI systems
Frequently Asked Questions
What does Article 15 require for accuracy, specifically?
Article 15(1) requires that high-risk AI systems achieve an appropriate level of accuracy for their intended purpose and perform consistently throughout their lifecycle. Article 15(3) requires that accuracy levels and the relevant metrics be declared in the instructions for use accompanying the system under Article 13. The Act does not set a universal percentage threshold — the appropriate level depends on the use case, risk, and population the system serves. Providers must define the metrics, justify them in Annex IV §5, and demonstrate through testing that the system meets them on representative data, including stratified results by relevant subgroups.
Does Article 15 cover feedback loops in AI systems that continue to learn?
Yes. Article 15(3) specifically requires that systems continuing to learn after deployment be developed to eliminate or reduce, as far as possible, the risk that biased outputs influence future inputs — the feedback-loop problem. Providers must document mitigation measures: periodic bias audits, input-weighting controls, hold-out validation sets drawn before any retraining cycle, and human-review gates before retrained models go live. These measures and the evidence supporting them belong in Annex IV §6 and §9.
What is Article 15(2) about — does it impose obligations on providers?
Article 15(2) directs the Commission, working with the AI Office and relevant bodies, to encourage development of benchmarks and measurement methodologies for accuracy and robustness of high-risk AI systems. It is an institutional mandate on the Commission, not a direct obligation on providers. In practice, it signals that harmonised test standards for specific Annex III domains are coming. Where recognised domain benchmarks already exist — ISO/IEC 29003 for biometrics, NIST AI 100-1 for facial recognition — providers should use them and document the choice.
Who is responsible for Article 15 compliance — provider or deployer?
Providers carry the primary obligations: designing, testing, and documenting accuracy, robustness, and cybersecurity before market entry; completing Annex IV §5–8; issuing the Declaration of Conformity. Deployers must monitor performance in their specific deployment context, verify that declared accuracy properties hold in their environment, detect drift and security incidents, and report serious incidents under Article 73. A deployer that substantially modifies the system — retraining on new data, changing input preprocessing, altering the decision boundary — becomes a provider for that version under Article 3(23) and Article 25, and must re-test and re-issue all documentation.
What AI-specific cybersecurity threats must providers address under Article 15?
Article 15(3) names: data poisoning (malicious samples in training data), model poisoning (tampering with model weights or parameters), adversarial examples / model evasion (crafted inputs causing misclassification), and confidentiality attacks (membership inference and model extraction). Providers must also address model flaws that can be systematically triggered. Controls — encryption of model artefacts and training data, access control on the training pipeline, audit logging, vulnerability management, incident response — must be documented in Annex IV §8 and integrated into the Article 9 risk management system.
How does Article 15 relate to NIS2 and ISO 27001?
There is deliberate overlap. NIS2 applies to operators of essential services and digital infrastructure, which includes several Annex III high-risk deployment contexts. ISO 27001 provides a general information security management baseline. Both address general infrastructure security but neither covers AI-specific attack vectors: data poisoning, model poisoning, model extraction, adversarial examples, membership inference. Providers must supplement existing security documentation with an explicit AI threat analysis in Annex IV §8. A NIS2 compliance attestation or an ISO 27001 certificate does not substitute for this.
When does Article 15 apply — what is the deadline?
Article 15 is part of the high-risk requirements in Chapter III Section 2 of Regulation (EU) 2024/1689. For stand-alone high-risk AI systems listed in Annex III — recruitment tools, credit-scoring models, biometric systems, and others — obligations apply from 2 December 2027 under the Digital Omnibus political agreement of May 2026, which deferred the original 2 August 2026 date. For high-risk AI embedded as safety components of products covered by EU product law under Annex I, the date is 2 August 2028.
Manage your EU AI Act compliance in one place
Confir automates risk classification, technical documentation, and audit trails for any company. No consultants. No 6-month projects. 7-day free trial.
Start free trial →