The regulatory focus on AI is narrowing from the models themselves to the data that trains them. The EU AI Act requires documentation of training data provenance and composition. The US Copyright Office has issued guidance requiring disclosure of copyrighted materials in training datasets. China’s draft AI regulations mandate training data audits. And a growing body of case law — from the New York Times v. OpenAI ruling to the Getty Images v. Stability AI proceedings — is establishing that the use of copyrighted material for model training creates legal liability.
For data teams, this means the question “where did this training data come from?” is transitioning from a best practice to a legal obligation with financial consequences.
The Regulatory Landscape
Three regulatory threads are converging on training data:
Data provenance requirements. The EU AI Act requires that high-risk AI systems be trained on datasets that are “relevant, representative, free of errors and complete.” This sounds like a quality standard, but the enforcement mechanism is documentation. You must be able to demonstrate, with evidence, that your training data meets these criteria. If you cannot produce the data sheet, you are non-compliant regardless of the actual data quality.
Copyright and licensing. The legal landscape around training data copyright is unsettled but moving toward requiring either licenses for copyrighted training data or clear fair-use justifications. The trend line is toward stricter requirements. Organizations that trained models on scraped web data without tracking provenance are discovering that retroactive compliance is extremely difficult. You cannot produce a license for data whose source you did not record.
Privacy and data protection. GDPR requires that personal data used in automated decision-making be subject to data protection impact assessments. If your training data contains personal data — and at web-scale, it almost certainly does — you need DPIAs for your training pipelines, not just your inference pipelines. The distinction between “we only use anonymized data” and actual compliance is wider than most teams realize.
The Compliance Problem
The core compliance problem is traceability. Most organizations cannot answer these questions about their training data:
- What sources does the training data come from?
- What is the license status of each source?
- Does the data contain personal data, and if so, under what legal basis is it processed?
- What preprocessing was applied, and does the preprocessing affect the licensing or privacy status?
- When was the data collected, and has its availability or licensing changed since?
For models trained on internally generated data, these questions are answerable with moderate effort. For models trained on web-scraped data, these questions are often unanswerable without re-indexing the entire dataset, which may be impractical for large models.
What Data Teams Should Do
Audit your training data inventory. For every model in production, document the training data sources, their licensing status, and the legal basis for their use. If the documentation does not exist, flag the model as a compliance risk.
Implement data provenance tracking in your training pipeline. Every dataset that enters the training pipeline should carry metadata about its source, collection date, license, and any transformations applied. This metadata should be stored alongside the model artifacts so that provenance is preserved through model versioning.
Establish a training data review process. Before a new data source is added to a training dataset, it should be reviewed for licensing, privacy, and quality. This review does not need to be onerous. A checklist that covers license compatibility, personal data presence, and source reliability is sufficient for most cases.
Prepare for data deletion requests. GDPR’s right to erasure may extend to trained models. If an individual requests that their personal data be removed from a training dataset, and the model was trained on that data, the organization may need to retrain the model without the offending data. This requires the ability to identify which training data contains the individual’s data, which is impossible without provenance tracking.
The Industry Response
Some organizations are responding with synthetic data generation — training models on artificially generated data that does not carry copyright or privacy obligations. This approach works for some use cases but introduces new risks: synthetic data can amplify biases present in the generation process and may not capture the distribution characteristics of real-world data.
Others are turning to licensed data marketplaces — platforms that provide training data with explicit licenses for AI use. These marketplaces are growing but remain small relative to the volume of data required for frontier model training.
The most common response, however, is continued ignorance. Many teams are operating under the assumption that training data regulation will not be enforced, or that their use of training data falls under fair use. This assumption is becoming increasingly risky.
Bounded Recommendation
If you train or fine-tune models, make training data provenance a first-class concern in your data pipeline. The cost of building provenance tracking is modest. The cost of not having it — regulatory penalties, litigation exposure, forced retraining — is significant and growing. Start with the models that serve the most regulated use cases and work backward.