ARTICLE 10 DATA GOVERNANCE HIGH-RISK AI

How to Implement Data Governance for EU AI Act Compliance

Article 10 of the EU AI Act places explicit requirements on the quality and governance of data used to train, validate, and test high-risk AI systems. These requirements sit alongside, and must be reconciled with, GDPR obligations. This guide explains what Article 10 requires in practice and how to implement the necessary data governance controls.

EU AI Act Reference

Article 10 applies to providers of high-risk AI systems. It requires that training, validation, and testing datasets be subject to data governance practices covering: the design choices regarding data collection, data preparation and processing operations, the formulation of relevant assumptions, an assessment of the availability, quantity and suitability of data, and an examination for possible biases. Data must be relevant, representative, free of errors, and complete to the best extent possible given the intended purpose.

Article 10 Data Quality Criteria

Article 10 sets out specific quality characteristics that training, validation, and test data must meet. These apply to high-risk AI providers and must be demonstrable through documentation:

Relevant

Data must be appropriate for the intended purpose of the AI system. The inputs must relate meaningfully to what the model is designed to predict or decide.

Representative

Training data must adequately represent the population or scenario the AI system will encounter in deployment, including relevant demographic groups.

Free of errors

Data must be checked for inaccuracies, duplications, and labelling errors that could introduce systematic bias or degrade model performance.

Complete

Datasets must not have systematic gaps or missing values that would cause the model to fail on particular subpopulations or scenarios.

Appropriate statistical properties

The dataset must reflect the real-world distribution of scenarios the system will encounter, with sufficient statistical depth to enable reliable generalisation.

GDPR-compliant processing

Where training data includes personal data, processing must have a valid lawful basis under GDPR. Article 10 does not override GDPR; both frameworks apply simultaneously.

Step-by-Step: Building AI Data Governance

Inventory All Data Sources Used in AI Systems

Before assessing data quality, you need to know what data is being used. Document every data source feeding into training, validation, and test datasets for each high-risk AI system. Include: internal databases, third-party data purchases, publicly scraped data, synthetic data, and data collected from production systems. For each source, record its origin, volume, date range, the population it represents, and any known limitations. Also document data used in inference, the live inputs your AI system processes in production, as these are subject to quality considerations distinct from training data.

Assess Data Quality Against Article 10 Criteria

For each dataset, conduct a structured quality assessment covering the six criteria above. Document findings formally. This documentation forms part of your technical documentation under Article 11 and will need to be available to market surveillance authorities on request. Where data quality falls short of requirements, document what steps were taken to address the deficiency (e.g., oversampling underrepresented groups, deduplication, relabelling) and what residual risks remain.

Implement PII Detection and Control

Personal data in training sets and live inference inputs presents both a data quality risk and a GDPR compliance risk. Training on poorly handled PII can cause models to inadvertently memorise and reproduce personal information in outputs, a risk documented across many LLM deployments. Implement automated PII scanning across training datasets before use, and monitor live inference API traffic for PII flowing into or out of AI models in production. Real-time inspection of AI API requests can detect unexpected personal data being submitted to AI endpoints, which may indicate misuse or unintended data flows that require investigation.

Build Data Lineage Documentation

Article 10 requires that providers document data governance and data preparation activities. Data lineage, the traceable record of where data came from, how it was processed, and how it was used, is the practical implementation of this. For each dataset, document: source systems, collection methods, transformation steps applied (normalisation, anonymisation, augmentation), and the resulting dataset version used for training. Data lineage documentation enables you to respond to regulator requests and to investigate root causes if model performance issues arise.

Apply Access Controls to Training Data

Implement least-privilege access controls for datasets used in AI development. Not all engineers working on an AI system need access to raw training data containing personal information. Segment access: data scientists may work with anonymised or aggregated versions of training data, while access to raw personal data is restricted to those with a documented need. Maintain an audit log of access to training datasets. These controls reduce the risk of inadvertent data leakage and demonstrate that appropriate safeguards were in place.

Manage Third-Party Data Quality Obligations

Many AI systems are trained on, or make inferences with, data sourced from third parties, including vendors, data brokers, or external APIs. If you are the high-risk AI provider, the Article 10 obligations apply to you regardless of data origin. Your contracts with data suppliers should include warranties about data quality, compliance with applicable law, and rights to use the data for the stated purpose. Conduct due diligence on third-party data providers proportionate to the risk. For high-risk AI systems, this should be documented and repeatable.

The GDPR Intersection

Article 10 does not replace or override GDPR; both frameworks apply simultaneously to AI systems that process personal data. Key considerations include:

Lawful basis for training: If your training data includes personal data, you must identify a lawful basis under GDPR Article 6. Legitimate interests is commonly used, but the balancing test must be documented and may not justify all training uses.
Special category data: Training on special category data (health, ethnicity, religion, sexual orientation, biometric data, etc.) requires an explicit exception under GDPR Article 9, not just Article 6 lawful basis.
Data minimisation: GDPR requires that no more personal data than necessary is processed. Training on large datasets of personal data "just in case" is difficult to justify under this principle.
Data subject rights: Individuals may exercise rights to erasure under GDPR, which creates practical challenges for AI systems where individual data points are embedded in model weights. Document your approach to handling such requests.

Common Data Governance Failures

PII leaking into model outputs: When training data containing personal information is not adequately handled, models can memorise and reproduce it, returning names, addresses, or other personal details in response to prompts. This constitutes a GDPR breach and undermines the AI system's reliability.

Unrepresentative training data causing discriminatory outputs: If training data systematically under-represents certain demographic groups, the model's performance will be lower for those groups. In high-risk contexts such as employment, credit, or healthcare, this can constitute unlawful discrimination and violates Article 10's representativeness requirement.

No documentation of data preparation choices: Informal data preparation decisions (e.g., choosing to exclude certain edge cases from training) can have significant effects on model behaviour and create compliance risk if undocumented. All material data preparation choices should be recorded with their rationale.

← Back to Assessment