The ethics of training on copyrighted data

The ethics of training on copyrighted data — a nuanced take

Simor Consulting | 18 May, 2026 | 05 Mins read

The legal system has not caught up with the practice of training AI models on copyrighted data, and the people building AI systems are not waiting for it. Models trained on books, articles, code repositories, images, and music continue to be developed, deployed, and commercialized while courts sort out whether the training constitutes fair use, infringement, or something the existing legal framework was never designed to address.

This creates a practical problem for organizations building or deploying AI systems. The legal risk is real but undefined. The ethical concerns are legitimate but unresolved. And the competitive pressure to use all available data — including copyrighted data — is intense, because organizations that limit their training data voluntarily are competing against organizations that do not.

I do not have a clean answer. No one does. But I have a framework for thinking about it that has helped the organizations I work with make defensible decisions while the legal landscape settles.

The fair use argument and its limits

The strongest legal argument for training on copyrighted data is the fair use doctrine, specifically the transformative use argument. Training a model on copyrighted text does not reproduce the text. It extracts statistical patterns that the model uses to generate new outputs. The copyrighted work is an input to a computation, not a reproduction. By analogy, a person who reads a thousand novels and then writes an original novel has used the copyrighted works as inputs to a creative process, and no one considers that infringement.

The analogy holds in many cases. A language model trained on millions of documents produces outputs that do not reproduce any specific input document. The model has learned patterns — grammar, style, reasoning structures — not memorized passages. This is genuinely transformative, and the legal scholars who argue for fair use in this context have a strong position.

The analogy breaks in specific cases. Models can be prompted to reproduce substantial portions of copyrighted works, particularly well-known texts that appear frequently in training data. Code models trained on open-source repositories sometimes reproduce copyrighted code verbatim, including license headers that the model strips. Image models can be prompted to generate images “in the style of” a specific artist, raising questions about whether the model’s output is derivative of the artist’s copyrighted work.

The limits of the fair use argument matter practically because they define the boundary between training that is likely to survive legal challenge and training that is not. Organizations that train on copyrighted data and implement strong safeguards against verbatim reproduction have a defensible position. Organizations that train on copyrighted data and do not implement such safeguards are taking a risk that the legal system may eventually quantify.

The economic justice argument

The fair use argument is a legal argument. The economic justice argument is a different conversation entirely.

When an AI company trains a model on a corpus of books, the company captures economic value from those books. The authors of those books created the value. They were not compensated for the use of their work in training. The AI company’s product competes, in some markets, with the authors whose work made the product possible.

This is not a hypothetical concern. Journalists have documented cases where AI-generated summaries of news articles reduce traffic to the original articles, directly reducing the advertising revenue that funds journalism. Authors have documented cases where AI-generated text mimics their distinctive style, creating substitutes for their work. Photographers and illustrators face a similar dynamic with image generation models.

The economic justice argument does not require proving that AI training is legally infringing. It requires acknowledging that value is being extracted from creators without compensation, and that this extraction has economic consequences for the people and industries that produce the training data.

Some organizations respond to this concern by licensing training data. Getty Images has pursued a licensed-data strategy for its image AI. Some publishers have negotiated licensing deals with AI companies. These arrangements are imperfect — the compensation is often small relative to the value extracted — but they represent an attempt to address the economic justice concern within existing market structures.

A framework for organizational decisions

For organizations building or deploying AI systems, the ethical question is not abstract. It is practical: what training data should we use, what safeguards should we implement, and what compensation mechanisms should we support?

I recommend a three-part framework.

Classify your data sources by risk. Publicly available data with permissive licenses is low risk. Copyrighted data with licensing agreements is medium risk. Copyrighted data scraped without permission is high risk. This classification is legal, not ethical — it maps to the likelihood of legal challenge, not the moral weight of the decision. But it is a necessary starting point because organizations need to understand their legal exposure.

Implement reproduction safeguards regardless of data source. Even if you believe your use of copyrighted training data is fair use, implement technical safeguards against verbatim reproduction. Memorization detection, output filtering, and attribution mechanisms reduce the risk that your model produces outputs that are substantially similar to copyrighted inputs. These safeguards also strengthen your fair use argument by demonstrating that you took steps to prevent the kind of reproduction that would weaken it.

Support compensation mechanisms, even if you are not legally required to. This is the ethical recommendation, not the legal one. If your model generates value from copyrighted training data, contribute to the ecosystem that produces that data. This might mean licensing agreements, revenue-sharing arrangements, or contributions to creator funds. The specific mechanism matters less than the principle: if you extract value from a resource, contribute to the sustainability of that resource.

The uncomfortable position

Most organizations building AI systems are in an uncomfortable position. They know that their training data includes copyrighted material. They know that the legal framework has not resolved whether this use is permitted. They know that competitive dynamics favor using all available data. And they know that there are legitimate ethical concerns that the legal framework may not address even when it does resolve.

The organizations I respect most are the ones that acknowledge this discomfort rather than resolving it prematurely in either direction. They do not pretend that copyright concerns are irrelevant because fair use might apply. They do not refuse to train on copyrighted data because the ethical questions are unresolved. They make deliberate decisions about data sources, implement safeguards, support compensation mechanisms, and maintain the flexibility to adjust their approach as the legal and ethical landscape develops.

The heuristic: if you would not be comfortable explaining your training data practices to the creators whose work is in your dataset, you have a problem. Not a legal problem necessarily, but an ethical one. And ethical problems that go unaddressed have a tendency to become legal problems eventually.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

AI Ethics Machine Learning

Privacy-Preserving Machine Learning Techniques

30 Jan, 2024 | 03 Mins read

ML models require data to train effectively, but this data often contains sensitive personal information. Privacy-preserving ML (PPML) techniques enable organizations to build effective models while s

Thought Leadership Organizational Design

Why most AI transformations fail (it's not the technology)

20 Apr, 2026 | 04 Mins read

The CTO of a mid-size financial services firm told me they had spent $4 million on AI tooling in eighteen months. They had three large language model providers under contract, a vector database cluste

Thought Leadership Data Culture

The case for AI skepticism in your data strategy

27 Apr, 2026 | 04 Mins read

I was in a strategy session where a VP of Data told the room that generative AI would "eliminate the need for data analysts within two years." The room nodded. Budget was reallocated. Three analyst po

Thought Leadership Organizational Design

What we can learn from the DevOps revolution applied to AI

04 May, 2026 | 04 Mins read

In 2009, deploying software to production was an event. It involved a change request, a maintenance window, a runbook, and a prayer. Developers wrote code, then threw it over the wall to operations, w

Thought Leadership Data Culture

Building a data-driven culture: lessons from 50 engagements

13 May, 2026 | 05 Mins read

The phrase "data-driven culture" has been emptied of meaning by overuse. It appears in every strategy deck, every job posting, every conference talk. Everyone claims to want it. Almost no one can desc

Thought Leadership AI Ethics

Why your AI team needs philosophers, not just engineers

25 May, 2026 | 05 Mins read

A hiring manager at a large tech company told me they had four hundred engineers working on their AI platform and zero people with training in philosophy, ethics, or the social sciences. When I asked

Trends Thought Leadership

The great model commoditization: what happens when everyone has GPT-5

30 May, 2026 | 03 Mins read

OpenAI shipped GPT-5. Anthropic shipped Claude 4. Google shipped Gemini Ultra 2. Within six weeks of each other, the three leading model providers released frontier models that are, by most benchmarks

Thought Leadership Organizational Design

The paradox of AI automation: more tools, less productivity?

01 Jun, 2026 | 05 Mins read

A data engineering team I worked with had adopted six AI-powered tools in twelve months. An automated code reviewer, a data quality scanner, a pipeline orchestrator with intelligent retry, a natural l

Thought Leadership Career

Career paths in AI data engineering: 2026 edition

08 Jun, 2026 | 04 Mins read

Three years ago, "data engineer" was a coherent job title. You built pipelines, managed infrastructure, and moved data from where it was to where it needed to be. The role required SQL, Python, and a

Thought Leadership Career

Books every AI leader should read this year

10 Jun, 2026 | 04 Mins read

Most reading lists for AI leaders are assembled by people who sell AI. The lists are full of books about machine learning techniques, deep learning architectures, and the latest framework documentatio

Thought Leadership Data Culture

The invisible infrastructure: why data plumbing matters more than models

15 Jun, 2026 | 05 Mins read

A Fortune 500 company hired a team of twelve machine learning engineers and tasked them with building a predictive maintenance system for their manufacturing floor. The ML team spent four months evalu

Trends Thought Leadership

Why 'AI engineer' is the fastest-growing job title (and what it means)

17 Jun, 2026 | 04 Mins read

LinkedIn's latest workforce report shows "AI engineer" as the fastest-growing job title for the third consecutive quarter. Job postings containing the title increased 280% year-over-year. The growth r

Thought Leadership AI Ethics

Open-source sustainability: who pays for the code everyone uses?

22 Jun, 2026 | 05 Mins read

A critical open-source library used by thousands of companies, including several Fortune 500 firms, is maintained by one person in their spare time. This is not a hypothetical. It is a description of

Thought Leadership Data Culture

Why I stopped chasing the latest AI framework

29 Jun, 2026 | 04 Mins read

In 2023, I rewrote a data pipeline three times because the framework landscape kept shifting. First it was built on LangChain. Then the team wanted to switch to LlamaIndex because it handled retrieval

Thought Leadership Career

The loneliness of being the only data engineer on the team

06 Jul, 2026 | 05 Mins read

There is a version of the data engineering career that nobody warns you about. It is not the startup grind or the big-company bureaucracy. It is being the only data engineer on a team of people who do

Thought Leadership Data Culture

Technical debt in ML systems: a honest accounting

13 Jul, 2026 | 05 Mins read

Google's 2015 paper "Hidden Technical Debt in Machine Learning Systems" described a problem that has only gotten worse in the decade since. The paper's central observation was that the model itself is

AI Ethics Responsible AI

Responsible AI: Bias Detection and Mitigation

07 Aug, 2024 | 12 Mins read

# Responsible AI: Bias Detection and Mitigation AI systems influence critical decisions in healthcare, finance, hiring, and criminal justice. When these systems produce unfair outcomes, they can perp

AI Ethics Decision Systems

Ethical Considerations in AI-Powered Decision Systems

17 Nov, 2024 | 03 Mins read

AI increasingly powers high-stakes decision systems across industries. Organizations deploying AI-powered decision systems face complex questions about fairness, transparency, privacy, and accountabil

Trends Thought Leadership

2025 Year-in-Review & 2026 Trends in Data & AI Architecture

19 Dec, 2025 | 03 Mins read

2025 was the year AI moved from experimentation to industrialization. While 2024 saw the explosion of generative AI capabilities, 2025 was about making those capabilities production-ready, cost-effect

AI Operating System Thought Leadership

The AI Operating System: Why Companies Need an AI Foundation Layer

05 Jan, 2026 | 16 Mins read

A financial services firm spent eight months building an AI-powered document analysis system. When it came time to deploy, they discovered their retrieval system had no governance layer, their agent h

AI Enablement Thought Leadership

AI Enablement Programs: Building Organizational Capability, Not Just Technology

19 Mar, 2026 | 11 Mins read

A technology company built an impressive AI platform. They had GPU clusters, fine-tuning pipelines, evaluation frameworks, and a growing model registry. They opened access to any team that wanted to u