Is AI training on copyrighted data theft or fair use?
Every major AI model was trained on copyrighted material — books, articles, images, music, code — mostly without permission or payment. The New York Times is suing OpenAI. Getty Images sued Stability AI. Authors, artists, and musicians are furious. The outcome of this debate will determine whether AI companies owe trillions in damages or whether copyright law gets rewritten for the AI age.
Where They Stand
Training on copyrighted data is fair use
Sam Altman has argued that training AI on publicly available data is a form of "learning" analogous to how humans learn by reading — and that requiring licenses for training data would make AI development impossible for anyone except the wealthiest companies. OpenAI's legal position in the NYT lawsuit is that training constitutes transformative fair use: the model doesn't store or reproduce the original works, it learns patterns from them. Yann LeCun has been even more direct, arguing on social media that restricting training data would be like "making it illegal for humans to learn from books they've read" and that the entire open-source AI ecosystem depends on broad training data access. Mark Zuckerberg has framed Meta's approach as democratising AI — Llama models trained on broad internet data ensure that AI isn't controlled only by companies wealthy enough to license proprietary datasets. The fair-use camp's strongest argument: search engines also "read" the entire internet without licensing every page, and courts ruled that was transformative. Their weakest: unlike search, AI models can generate outputs that directly compete with the original creators.
It's theft and creators must be compensated
Gary Marcus has been one of the most vocal advocates for creator rights in the AI training debate. He argues that the "fair use" defence is a legal fig leaf that lets trillion-dollar companies profit from the work of millions of creators who never consented and never got paid. He draws a sharp distinction between a human reading a book (who might buy one, recommend it, write a review) and a model ingesting millions of books to create a product that replaces the need to buy books at all. Tristan Harris, through the Center for Humane Technology, frames this as another instance of Big Tech's established pattern: extract value from users and creators first, seek forgiveness later, and rely on the fact that by the time regulation catches up, the industry is too entrenched to unwind. He argues that generative AI is the largest wealth transfer from individual creators to tech corporations in history. Both point to concrete evidence: the NYT lawsuit showed ChatGPT could reproduce near-verbatim passages from paywalled articles, Stability AI's models generate images in the style of specific living artists, and code models reproduce GPL-licensed code without attribution. The creators' camp argues that "transformative use" has a limit, and that limit was crossed when AI outputs directly substitute for the original works.
New licensing frameworks needed
Mustafa Suleyman has argued that a new legal framework is needed — one that recognises AI training as something genuinely new, not neatly fitting into existing copyright categories. He has proposed mechanisms similar to music licensing: collective licensing bodies that negotiate rates on behalf of content creators, with AI companies paying into a pool based on training data usage. Satya Nadella has positioned Microsoft as willing to work with publishers, signing content licensing deals with news organisations and creating opt-out mechanisms for Bing's AI features. Microsoft's approach is pragmatic: settle with major publishers, create enough licensing infrastructure to make the legal risk manageable, and move on. Jensen Huang has framed this from the compute infrastructure side — NVIDIA benefits regardless of who wins, but has advocated for "synthetic data" as a long-term solution: training future models on AI-generated data rather than human-created data, which sidesteps the copyright question entirely. The licensing camp's position: the genie is out of the bottle, retroactive punishment is impractical, so build forward-looking frameworks that compensate creators without killing innovation.
Copyright law itself needs rethinking
Andrej Karpathy has argued that copyright law, designed for an era of printing presses and vinyl records, is fundamentally unequipped to handle AI. He has noted that the concept of "copying" breaks down when a model distils patterns from billions of documents into statistical weights that don't contain any single original work. The model doesn't "have" your book inside it any more than your brain "has" every book you've read. He suggests that new legal concepts are needed that distinguish between memorisation (which models can do and shouldn't) and generalisation (which is the entire point of learning). Emad Mostaque, former CEO of Stability AI which faced the Getty Images lawsuit head-on, took the most radical position: that copyright in its current form is a pre-internet relic that serves publishers more than creators, and that AI should accelerate the transition toward new creator compensation models (patronage, subscriptions, direct support) rather than propping up a broken licensing system. His argument is that the music industry fought file-sharing for a decade before Spotify proved that access models could pay creators more than the old system — AI will force a similar reckoning for text and images.
Patrick's Take
This debate hits different when you're training Malaysian businesses on AI, because most of my clients are consumers of AI-generated content, not creators whose work was ingested into models. But that doesn't mean you can ignore it. Here's the practical reality I share in every training session: if you're using AI to generate marketing copy, blog posts, product descriptions, or images for your business, you have a copyright exposure you probably haven't thought about. No court has definitively ruled on who owns AI-generated output. If Midjourney generates an image that's suspiciously close to a copyrighted work, and you publish it on your website, you might be the one holding the bag — not Midjourney. What I advise: use AI for first drafts and internal work freely. For anything customer-facing, have a human review and substantially modify the output. Never use AI-generated images that mimic a specific artist's style. And keep records of your prompts and editing process — if you can show substantial human creative input, your position is much stronger. This isn't paranoia; it's the same basic risk management you'd apply to any business tool.
What This Means for Your Business
Three things to do right now. First, create an AI content policy for your business that distinguishes between internal use (low risk — use AI freely for brainstorming, drafts, analysis) and external publication (moderate risk — human review and modification required). Second, if your business creates original content that has commercial value (photography, written guides, training materials, designs), check whether your hosting platforms allow AI scraping and opt out where possible. The NYT lawsuit revealed that even paywalled content was ingested. Third, watch the legal outcomes. The NYT v. OpenAI case, Getty v. Stability AI, and the Authors Guild class action will set precedents that affect every business using generative AI. If courts rule against fair use, AI tool pricing will increase significantly as companies are forced to license training data — factor that into your AI budget projections.
What to Actually Worry About
The real business risk isn't getting sued for using AI-generated content — it's building your brand on content that has no clear legal ownership. If you generate 500 blog posts with AI and a court later rules that AI-generated text isn't copyrightable (which the US Copyright Office has signalled), your competitors can legally copy all of it. You have no IP protection for purely AI-generated work. The fix is simple: ensure meaningful human creativity in everything you publish. Edit, add original insights, include your own data and analysis. The content that survives both the legal uncertainty and the SEO quality filters is the content that has genuine human value layered on top of AI efficiency.
Featured Minds in This Debate
Sam Altman
CEO, OpenAI
Yann LeCun
VP & Chief AI Scientist, Meta
Mark Zuckerberg
Founder & CEO, Meta
Gary Marcus
Professor Emeritus of Psychology & Neural Science, New York University
Tristan Harris
Co-Founder & Executive Director, Center for Humane Technology
Mustafa Suleyman
CEO, Microsoft AI, Microsoft
Satya Nadella
Chairman & CEO, Microsoft
Jensen Huang
Founder, President & CEO, NVIDIA
Andrej Karpathy
Independent Researcher & Educator, Independent (formerly Tesla, OpenAI)
Emad Mostaque
Former CEO & Founder, Stability AI (resigned 2024)
Last updated: 2026-04-13
←Back to AI Minds Directory