The ethics of training on copyrighted data — a nuanced take

The ethics of training on copyrighted data — a nuanced take

Simor Consulting | 18 May, 2026 | 05 Mins read

The legal system has not caught up with the practice of training AI models on copyrighted data, and the people building AI systems are not waiting for it. Models trained on books, articles, code repositories, images, and music continue to be developed, deployed, and commercialized while courts sort out whether the training constitutes fair use, infringement, or something the existing legal framework was never designed to address.

This creates a practical problem for organizations building or deploying AI systems. The legal risk is real but undefined. The ethical concerns are legitimate but unresolved. And the competitive pressure to use all available data — including copyrighted data — is intense, because organizations that limit their training data voluntarily are competing against organizations that do not.

I do not have a clean answer. No one does. But I have a framework for thinking about it that has helped the organizations I work with make defensible decisions while the legal landscape settles.

The fair use argument and its limits

The strongest legal argument for training on copyrighted data is the fair use doctrine, specifically the transformative use argument. Training a model on copyrighted text does not reproduce the text. It extracts statistical patterns that the model uses to generate new outputs. The copyrighted work is an input to a computation, not a reproduction. By analogy, a person who reads a thousand novels and then writes an original novel has used the copyrighted works as inputs to a creative process, and no one considers that infringement.

The analogy holds in many cases. A language model trained on millions of documents produces outputs that do not reproduce any specific input document. The model has learned patterns — grammar, style, reasoning structures — not memorized passages. This is genuinely transformative, and the legal scholars who argue for fair use in this context have a strong position.

The analogy breaks in specific cases. Models can be prompted to reproduce substantial portions of copyrighted works, particularly well-known texts that appear frequently in training data. Code models trained on open-source repositories sometimes reproduce copyrighted code verbatim, including license headers that the model strips. Image models can be prompted to generate images “in the style of” a specific artist, raising questions about whether the model’s output is derivative of the artist’s copyrighted work.

The limits of the fair use argument matter practically because they define the boundary between training that is likely to survive legal challenge and training that is not. Organizations that train on copyrighted data and implement strong safeguards against verbatim reproduction have a defensible position. Organizations that train on copyrighted data and do not implement such safeguards are taking a risk that the legal system may eventually quantify.

The economic justice argument

The fair use argument is a legal argument. The economic justice argument is a different conversation entirely.

When an AI company trains a model on a corpus of books, the company captures economic value from those books. The authors of those books created the value. They were not compensated for the use of their work in training. The AI company’s product competes, in some markets, with the authors whose work made the product possible.

This is not a hypothetical concern. Journalists have documented cases where AI-generated summaries of news articles reduce traffic to the original articles, directly reducing the advertising revenue that funds journalism. Authors have documented cases where AI-generated text mimics their distinctive style, creating substitutes for their work. Photographers and illustrators face a similar dynamic with image generation models.

The economic justice argument does not require proving that AI training is legally infringing. It requires acknowledging that value is being extracted from creators without compensation, and that this extraction has economic consequences for the people and industries that produce the training data.

Some organizations respond to this concern by licensing training data. Getty Images has pursued a licensed-data strategy for its image AI. Some publishers have negotiated licensing deals with AI companies. These arrangements are imperfect — the compensation is often small relative to the value extracted — but they represent an attempt to address the economic justice concern within existing market structures.

A framework for organizational decisions

For organizations building or deploying AI systems, the ethical question is not abstract. It is practical: what training data should we use, what safeguards should we implement, and what compensation mechanisms should we support?

I recommend a three-part framework.

Classify your data sources by risk. Publicly available data with permissive licenses is low risk. Copyrighted data with licensing agreements is medium risk. Copyrighted data scraped without permission is high risk. This classification is legal, not ethical — it maps to the likelihood of legal challenge, not the moral weight of the decision. But it is a necessary starting point because organizations need to understand their legal exposure.

Implement reproduction safeguards regardless of data source. Even if you believe your use of copyrighted training data is fair use, implement technical safeguards against verbatim reproduction. Memorization detection, output filtering, and attribution mechanisms reduce the risk that your model produces outputs that are substantially similar to copyrighted inputs. These safeguards also strengthen your fair use argument by demonstrating that you took steps to prevent the kind of reproduction that would weaken it.

Support compensation mechanisms, even if you are not legally required to. This is the ethical recommendation, not the legal one. If your model generates value from copyrighted training data, contribute to the ecosystem that produces that data. This might mean licensing agreements, revenue-sharing arrangements, or contributions to creator funds. The specific mechanism matters less than the principle: if you extract value from a resource, contribute to the sustainability of that resource.

The uncomfortable position

Most organizations building AI systems are in an uncomfortable position. They know that their training data includes copyrighted material. They know that the legal framework has not resolved whether this use is permitted. They know that competitive dynamics favor using all available data. And they know that there are legitimate ethical concerns that the legal framework may not address even when it does resolve.

The organizations I respect most are the ones that acknowledge this discomfort rather than resolving it prematurely in either direction. They do not pretend that copyright concerns are irrelevant because fair use might apply. They do not refuse to train on copyrighted data because the ethical questions are unresolved. They make deliberate decisions about data sources, implement safeguards, support compensation mechanisms, and maintain the flexibility to adjust their approach as the legal and ethical landscape develops.

The heuristic: if you would not be comfortable explaining your training data practices to the creators whose work is in your dataset, you have a problem. Not a legal problem necessarily, but an ethical one. And ethical problems that go unaddressed have a tendency to become legal problems eventually.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

Privacy-Preserving Machine Learning Techniques
Privacy-Preserving Machine Learning Techniques
30 Jan, 2024 | 03 Mins read

ML models require data to train effectively, but this data often contains sensitive personal information. Privacy-preserving ML (PPML) techniques enable organizations to build effective models while s

Why most AI transformations fail (it's not the technology)
Why most AI transformations fail (it's not the technology)
20 Apr, 2026 | 04 Mins read

The CTO of a mid-size financial services firm told me they had spent $4 million on AI tooling in eighteen months. They had three large language model providers under contract, a vector database cluste

The case for AI skepticism in your data strategy
The case for AI skepticism in your data strategy
27 Apr, 2026 | 04 Mins read

I was in a strategy session where a VP of Data told the room that generative AI would "eliminate the need for data analysts within two years." The room nodded. Budget was reallocated. Three analyst po

What we can learn from the DevOps revolution applied to AI
What we can learn from the DevOps revolution applied to AI
04 May, 2026 | 04 Mins read

In 2009, deploying software to production was an event. It involved a change request, a maintenance window, a runbook, and a prayer. Developers wrote code, then threw it over the wall to operations, w

Building a data-driven culture: lessons from 50 engagements
Building a data-driven culture: lessons from 50 engagements
13 May, 2026 | 05 Mins read

The phrase "data-driven culture" has been emptied of meaning by overuse. It appears in every strategy deck, every job posting, every conference talk. Everyone claims to want it. Almost no one can desc

Why your AI team needs philosophers, not just engineers
Why your AI team needs philosophers, not just engineers
25 May, 2026 | 05 Mins read

A hiring manager at a large tech company told me they had four hundred engineers working on their AI platform and zero people with training in philosophy, ethics, or the social sciences. When I asked

The great model commoditization: what happens when everyone has GPT-5
The great model commoditization: what happens when everyone has GPT-5
30 May, 2026 | 03 Mins read

OpenAI shipped GPT-5. Anthropic shipped Claude 4. Google shipped Gemini Ultra 2. Within six weeks of each other, the three leading model providers released frontier models that are, by most benchmarks

Responsible AI: Bias Detection and Mitigation
Responsible AI: Bias Detection and Mitigation
07 Aug, 2024 | 12 Mins read

# Responsible AI: Bias Detection and Mitigation AI systems influence critical decisions in healthcare, finance, hiring, and criminal justice. When these systems produce unfair outcomes, they can perp

Ethical Considerations in AI-Powered Decision Systems
Ethical Considerations in AI-Powered Decision Systems
17 Nov, 2024 | 03 Mins read

AI increasingly powers high-stakes decision systems across industries. Organizations deploying AI-powered decision systems face complex questions about fairness, transparency, privacy, and accountabil

2025 Year-in-Review & 2026 Trends in Data & AI Architecture
2025 Year-in-Review & 2026 Trends in Data & AI Architecture
19 Dec, 2025 | 03 Mins read

2025 was the year AI moved from experimentation to industrialization. While 2024 saw the explosion of generative AI capabilities, 2025 was about making those capabilities production-ready, cost-effect

The AI Operating System: Why Companies Need an AI Foundation Layer
The AI Operating System: Why Companies Need an AI Foundation Layer
05 Jan, 2026 | 16 Mins read

A financial services firm spent eight months building an AI-powered document analysis system. When it came time to deploy, they discovered their retrieval system had no governance layer, their agent h

AI Enablement Programs: Building Organizational Capability, Not Just Technology
AI Enablement Programs: Building Organizational Capability, Not Just Technology
19 Mar, 2026 | 11 Mins read

A technology company built an impressive AI platform. They had GPU clusters, fine-tuning pipelines, evaluation frameworks, and a growing model registry. They opened access to any team that wanted to u