AI Training’s Copyright Crossroads

Reading Time : 2 minutes

The rise of generative AI has pushed copyright law into uncharted territory. Training large language models often involves scraping vast amounts of online content—much of which is copyrighted. This practice is now under intense legal scrutiny, and the outcomes could redefine the rules for building AI systems. Courts are beginning to weigh in, and their decisions may fundamentally alter how models are trained and what data can be used.

One of the key legal cases in this space involves Thomson Reuters suing Ross Intelligence. The case centered around Ross’s use of Westlaw headnotes—specially prepared legal summaries—to train a non-generative AI legal search engine. The court ruled this did not fall under fair use, a decision that sent ripples through the AI development world. While Ross wasn’t using generative AI like GPT or Claude, the case still raised questions about how much data use is too much without permission.

Meanwhile, The New York Times has taken legal action against OpenAI and Microsoft, arguing that its articles were used without authorization to train generative models like ChatGPT. This case is particularly significant because it directly addresses generative AI and copyrighted journalism. The court has decided the case can proceed, which means we’re likely to get more clarity in the coming months or years about the legal boundaries of AI training.

For now, there’s no one-size-fits-all answer. Some experts argue that using data for training models can be considered transformative and thus protected under fair use. Others say AI companies will eventually have to strike licensing deals to access high-quality content legally. Until the courts make definitive rulings—or lawmakers step in—AI developers are navigating a gray area that could shape the future of tech and creative work alike.