The legal system has not caught up with the practice of training AI models on copyrighted data, and the people building AI systems are not waiting for it. Models trained on books, articles, code repositories, images, and music continue to be developed, deployed, and commercialized while courts sort out whether the training constitutes fair use, infringement, or something the existing legal framework was never designed to address.
This creates a practical problem for organizations building or deploying AI systems. The legal risk is real but undefined. The ethical concerns are legitimate but unresolved. And the competitive pressure to use all available data — including copyrighted data — is intense, because organizations that limit their training data voluntarily are competing against organizations that do not.
I do not have a clean answer. No one does. But I have a framework for thinking about it that has helped the organizations I work with make defensible decisions while the legal landscape settles.
The fair use argument and its limits
The strongest legal argument for training on copyrighted data is the fair use doctrine, specifically the transformative use argument. Training a model on copyrighted text does not reproduce the text. It extracts statistical patterns that the model uses to generate new outputs. The copyrighted work is an input to a computation, not a reproduction. By analogy, a person who reads a thousand novels and then writes an original novel has used the copyrighted works as inputs to a creative process, and no one considers that infringement.
The analogy holds in many cases. A language model trained on millions of documents produces outputs that do not reproduce any specific input document. The model has learned patterns — grammar, style, reasoning structures — not memorized passages. This is genuinely transformative, and the legal scholars who argue for fair use in this context have a strong position.
The analogy breaks in specific cases. Models can be prompted to reproduce substantial portions of copyrighted works, particularly well-known texts that appear frequently in training data. Code models trained on open-source repositories sometimes reproduce copyrighted code verbatim, including license headers that the model strips. Image models can be prompted to generate images “in the style of” a specific artist, raising questions about whether the model’s output is derivative of the artist’s copyrighted work.
The limits of the fair use argument matter practically because they define the boundary between training that is likely to survive legal challenge and training that is not. Organizations that train on copyrighted data and implement strong safeguards against verbatim reproduction have a defensible position. Organizations that train on copyrighted data and do not implement such safeguards are taking a risk that the legal system may eventually quantify.
The economic justice argument
The fair use argument is a legal argument. The economic justice argument is a different conversation entirely.
When an AI company trains a model on a corpus of books, the company captures economic value from those books. The authors of those books created the value. They were not compensated for the use of their work in training. The AI company’s product competes, in some markets, with the authors whose work made the product possible.
This is not a hypothetical concern. Journalists have documented cases where AI-generated summaries of news articles reduce traffic to the original articles, directly reducing the advertising revenue that funds journalism. Authors have documented cases where AI-generated text mimics their distinctive style, creating substitutes for their work. Photographers and illustrators face a similar dynamic with image generation models.
The economic justice argument does not require proving that AI training is legally infringing. It requires acknowledging that value is being extracted from creators without compensation, and that this extraction has economic consequences for the people and industries that produce the training data.
Some organizations respond to this concern by licensing training data. Getty Images has pursued a licensed-data strategy for its image AI. Some publishers have negotiated licensing deals with AI companies. These arrangements are imperfect — the compensation is often small relative to the value extracted — but they represent an attempt to address the economic justice concern within existing market structures.
A framework for organizational decisions
For organizations building or deploying AI systems, the ethical question is not abstract. It is practical: what training data should we use, what safeguards should we implement, and what compensation mechanisms should we support?
I recommend a three-part framework.
Classify your data sources by risk. Publicly available data with permissive licenses is low risk. Copyrighted data with licensing agreements is medium risk. Copyrighted data scraped without permission is high risk. This classification is legal, not ethical — it maps to the likelihood of legal challenge, not the moral weight of the decision. But it is a necessary starting point because organizations need to understand their legal exposure.
Implement reproduction safeguards regardless of data source. Even if you believe your use of copyrighted training data is fair use, implement technical safeguards against verbatim reproduction. Memorization detection, output filtering, and attribution mechanisms reduce the risk that your model produces outputs that are substantially similar to copyrighted inputs. These safeguards also strengthen your fair use argument by demonstrating that you took steps to prevent the kind of reproduction that would weaken it.
Support compensation mechanisms, even if you are not legally required to. This is the ethical recommendation, not the legal one. If your model generates value from copyrighted training data, contribute to the ecosystem that produces that data. This might mean licensing agreements, revenue-sharing arrangements, or contributions to creator funds. The specific mechanism matters less than the principle: if you extract value from a resource, contribute to the sustainability of that resource.
The uncomfortable position
Most organizations building AI systems are in an uncomfortable position. They know that their training data includes copyrighted material. They know that the legal framework has not resolved whether this use is permitted. They know that competitive dynamics favor using all available data. And they know that there are legitimate ethical concerns that the legal framework may not address even when it does resolve.
The organizations I respect most are the ones that acknowledge this discomfort rather than resolving it prematurely in either direction. They do not pretend that copyright concerns are irrelevant because fair use might apply. They do not refuse to train on copyrighted data because the ethical questions are unresolved. They make deliberate decisions about data sources, implement safeguards, support compensation mechanisms, and maintain the flexibility to adjust their approach as the legal and ethical landscape develops.
The heuristic: if you would not be comfortable explaining your training data practices to the creators whose work is in your dataset, you have a problem. Not a legal problem necessarily, but an ethical one. And ethical problems that go unaddressed have a tendency to become legal problems eventually.